RE: switching to different parser in Pig

Olga Natkovich Tue, 25 Aug 2009 12:52:19 -0700

We don't need to package it - we only use it at compile time. There are other 
Apache projects such as Lucine that use JFlex.


Olga

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvrya...@cloudera.com] 
Sent: Tuesday, August 25, 2009 11:58 AM
To: pig-dev@hadoop.apache.org
Cc: pi.so...@gmail.com
Subject: Re: switching to different parser in Pig

Santosh,
Am I missing something about Jflex licensing? I thought that it being
GPL, we can't package it with apache-licensed software, which prevents
it from being a viable option (regardless of technical merits)

-Dmitriy

On Tue, Aug 25, 2009 at 1:58 PM, Santhosh Srinivasan<s...@yahoo-inc.com> wrote:
> Its been 6 months since this topic was discussed but we don't have
> closure on it.
> For SQL on top of Pig, we are using Jflex and CUP
> (https://issues.apache.org/jira/browse/PIG-824). If we have decided on
> the right parser, can we have a plan to move the other parsers in Pig to
> the same technology?
>
> Thanks,
> Santhosh
>
> PS: I am assuming we are not moving to Antlr.
>
>
> -----Original Message-----
> From: Alan Gates [mailto:ga...@yahoo-inc.com]
> Sent: Tuesday, February 24, 2009 10:17 AM
> To: pig-dev@hadoop.apache.org; pi.so...@gmail.com
> Subject: Re: switching to different parser in Pig
>
> Sorry, after I sent that email yesterday I realized I was not very
> clear.  I did not mean to imply that antlr didn't have good
> documentation or good error handling.  What I wanted to say was we
> want all three of those things, and it didn't appear that antlr
> provided all three, since it doesn't separate out scanner and parser.
> Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yacc
> to top down parsers like javacc.  My understanding is that antlr is
> top down like javacc.  My reasoning for this preference is that parser
> books and classes have used those for decades, so there are a large
> number of engineers out there (including me :) ) who know how to work
> with them.  But maybe antlr is close enough to what we need.  I'll
> take a deeper look at it before I vote officially on which way we
> should go.
>
> As for loops and branches, I'm not saying we need those in Pig Latin.
> We need them somehow.  Whether it's better to put them in Pig Latin or
> imbed pig in a existing script language is an ongoing debate.  I don't
> want to make a decision now that effectively ends that debate without
> buy in from those who feel strongly that Pig Latin should include
> those constructs.
>
> I agree with you that we should modify the logical plan to support
> this rather than add another layer.  As for active development, the
> only thing I'm aware of is we hope to start working on a more robust
> optimizer for pig soon, and that will require some additional
> functionality out of the logical operators, but it shouldn't cause any
> fundamental architectural changes.
>
> Alan.
>
>
> On Feb 24, 2009, at 1:27 AM, pi song wrote:
>
>> (1) Lack of good documentation which makes it hard to and time
>> consuming
>> to learn javacc and make changes to Pig grammar
>> <== ANTLR is very very well documented.
>> http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
>> http://media.pragprog.com/titles/tpantlr/toc.pdf
>> http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home
>>
>> (2) No easy way to customize error handling and error messages
>> <== ANTLR has very extensive error handling support
>> http://media.pragprog.com/titles/tpantlr/errors.pdf
>>
>> (3) Single path that performs both tokenizing and parsing
>> <== What is the advantage of decoupling tokenizer and parsing ?
>>
>> In addition, "Composite Grammar" is very useful for keeping the parser
>> modular. Things that can be treated as sub-languages such as bag
>> schema
>> definition can be done and unit tested separately.
>>
>> ANTLRWorks http://www.antlr.org/works/index.html
>> <http://www.antlr.org/works/index.html>also
>> makes grammar development very efficient. Think about IDE that helps
>> you
>> debug your code (which is grammar).
>>
>> One question, is there any use case for branching and loops? The
>> current Pig
>> is more like a query (declarative) language. I don't really see how
>> loop
>> constructs would fit. I think what Ted mentioned is more embedding
>> Pig in
>> other languages and use those languages to do loops.
>>
>> We should think about how the logical plan layer can be made simpler
>> for
>> external use so don't have to introduce a new layer. Is there any
>> major
>> active development on it? Currently I have more spare time and
>> should be
>> able to help out. (BTW, I'm slow because this is just my hobby. I
>> don't want
>> to drag you guys)
>>
>> Pi Song
>>
>> On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia
> <niteshbhatia...@gmail.com
>> >wrote:
>>
>>> Hi
>>> I got this info from javacc mailing lists. This may prove helpful:
>>>
>>>
>>>
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ----------------
>>> -----Original Message----- From: Ken Beesley
>>> [mailto:ken....@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56
>>> PM To: javacc Subject: [JavaCC] Alternatives to JavaCC (was Hello
>>> All)
>>>
>>> Vicas wrote:
>>>
>>> Hello All
>>>
>>> Kindly let me know other parsers available which does the same job as
>>> javacc.
>>>
>>> It would be very nice of you if you can send me some documentation
>>> related to this.
>>>
>>> Thanks Vikas
>>>
>>> (Correction and clarifications to the following would be _very_
>>> welcome. I'm very likely out of date.)
>>>
>>> Of course, no two software tools are likely to do _exactly_ the same
>>> job. Someone already pointed you to ANTLR, which is probably the
>>> best-known alternative to JavaCC. Another possibility is SableCC.
>>> http://sablecc.org
>>>
>>> The criteria include stability, documentation, language of the parser
>>> generated, and abstract-syntax-tree building.
>>>
>>> When I last looked (a couple of years ago) at ANTLR, SableCC and
>>> JavaCC, I chose JavaCC for the following reasons:
>>>
>>> 1. ANTLR could not handle Unicode input. Things change, of course, so
>>> ANTLR might now be more Unicode-friendly. Unicode was important to
>>> me,
>>> so this was a big factor in my decision.
>>>
>>> On the plus side for ANTLR, it has better abstract-syntax-tree
>>> building capabilities (in my opinion) than JJTree/JavaCC. You can
>>> learn to use JJTree commands, but it's not easy for most people.
>>>
>>> And ANTLR can generate either a Java or a C++ parser. JavaCC
>>> generates
>>> only Java parsers.
>>>
>>> Another concern about ANTLR was that it was reputed to change a lot
>>> as
>>> the guru, Terence Parr, experimented with new syntax and
>>> functionality. JavaCC, at least at the time, was reputed to be more
>>> stable, perhaps stable to a fault. I wanted stability and
>>> reliability.
>>>
>>> 2. SableCC is much like JavaCC; it generates a Java parser from a
>>> grammar description; but it had, in my opinion, less flexible
>>> abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
>>> looked at it), the AST it built was always a direct reflection of
>>> your
>>> grammar, generating one tree node for each grammar expansion involved
>>> in a parse, much like using JavaCC with Java Tree Builder (JTB
>>> http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
>>> alternative to using JJTree.
>>>
>>> Using SableCC, or the combination JavaCC/JTB, should be _very_
>>> similar
>>> indeed.
>>>
>>> In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
>>> simplify AST building--you get trees that reflect the expansions in
>>> your grammar. Period. But often these default trees will be big, full
>>> of extraneous nodes that reflect precedence hierarchies in the
>>> recursive-descent parsing. If you want to have more control over AST
>>> building, to get more compact and tailored ASTs, you need to pay the
>>> price of learning JJTree.
>>>
>>> Assuming that you need to build ASTs, with JavaCC you have the choice
>>> between JJTree and JTB. With SableCC, when I last looked at it, you
>>> only get the JTB-like option.
>>>
>>> *******
>>>
>>> (Again, corrections and expansions would be much appreciated.)
>>>
>>> Ken Beesley
>>>
>>>
>>>
>>>
>>>
>>>
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> ---
>>>
>>>
>>> Of course, no two software tools are likely to do _exactly_ the same
>>> job. Someone already pointed you to ANTLR, which is probably the
>>> best-known alternative to JavaCC. Another possibility is SableCC.
>>> http://sablecc.org
>>>
>>> The criteria include stability, documentation, language of the parser
>>> generated, and abstract-syntax-tree building.
>>>
>>> When I last looked (a couple of years ago) at ANTLR, SableCC and
>>> JavaCC, I chose JavaCC for the following reasons:
>>>
>>> 1. ANTLR could not handle Unicode input. Things change, of course, so
>>> ANTLR might now be more Unicode-friendly. Unicode was important to
>>> me,
>>> so this was a big factor in my decision.
>>>
>>> On the plus side for ANTLR, it has better abstract-syntax-tree
>>> building capabilities (in my opinion) than JJTree/JavaCC. You can
>>> learn to use JJTree commands, but it's not easy for most people.
>>>
>>> And ANTLR can generate either a Java or a C++ parser. JavaCC
>>> generates
>>> only Java parsers.
>>>
>>> Another concern about ANTLR was that it was reputed to change a lot
>>> as
>>> the guru, Terence Parr, experimented with new syntax and
>>> functionality. JavaCC, at least at the time, was reputed to be more
>>> stable, perhaps stable to a fault. I wanted stability and
>>> reliability.
>>>
>>> 2. SableCC is much like JavaCC; it generates a Java parser from a
>>> grammar description; but it had, in my opinion, less flexible
>>> abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
>>> looked at it), the AST it built was always a direct reflection of
>>> your
>>> grammar, generating one tree node for each grammar expansion involved
>>> in a parse, much like using JavaCC with Java Tree Builder (JTB
>>> http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
>>> alternative to using JJTree.
>>>
>>> Using SableCC, or the combination JavaCC/JTB, should be _very_
>>> similar
>>> indeed.
>>>
>>> In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
>>> simplify AST building--you get trees that reflect the expansions in
>>> your grammar. Period. But often these default trees will be big, full
>>> of extraneous nodes that reflect precedence hierarchies in the
>>> recursive-descent parsing. If you want to have more control over AST
>>> building, to get more compact and tailored ASTs, you need to pay the
>>> price of learning JJTree.
>>>
>>> Assuming that you need to build ASTs, with JavaCC you have the choice
>>> between JJTree and JTB. With SableCC, when I last looked at it, you
>>> only get the JTB-like option.
>>>
>>> ----------
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 23, 2009 at 10:06 PM, Alan Gates <ga...@yahoo-inc.com>
>>> wrote:
>>>> We looked into antlr.  It appears to be very similar to javacc,
>>>> with the
>>>> added feature that the java code it generates is humanly
>>>> readable.  That
>>>> isn't why we want to switch off of javacc.  Olga listed the 3
>>>> things we
>>> want
>>>> out of a parser that javacc isn't giving us (lack of docs, no easy
>>>> customization of error handle, decoupling of scanning and
>>>> parsing).  So
>>>> antlr doesn't look viable.
>>>>
>>>> In response to Pi's suggestion that we could use the logical plan,
>>>> I hope
>>> we
>>>> could use something close to it.  Whatever we choose we want it to
>>>> be
>>>> flexible enough to represent richer language constructs (like
>>>> branch and
>>>> loop).  I'm not sure our current logical plan can do that.  At the
>>>> same
>>>> time, we don't need another layer of translation (we already have
>>>> logical
>>> ->
>>>> physical -> mapreduce).  I would like to find a representation
>>>> that could
>>>> handle expressing the syntax and what is currently the logical plan.
>>>>
>>>> Alan.
>>>>
>>>> On Feb 20, 2009, at 5:15 PM, pi song wrote:
>>>>
>>>>> Should be pretty close but we may need to cleanup the interface a
>>>>> bit.
>>>>> Then
>>>>> the new parser  module can be switched in easily.
>>>>> BTW, have we already got the solution for the new parser generator?
>>>>>
>>>>> Pi
>>>>>
>>>>>
>>>>> On Fri, Feb 20, 2009 at 9:03 PM, Ted Dunning
>>>>> <ted.dunn...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Probably nearly the same effect as you suggest.  Are the
>>>>>> concepts at
>>> the
>>>>>> logical plan layer similar to those expressed in pig latin?  Or
>>>>>> has a
>>>>>> significant transformation occurred by then?
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 20, 2009 at 1:59 AM, pi song <pi.so...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sounds good but how about exposing the logical plan layer
>>>>>>> instead?
>>>>>>> Wouldn't
>>>>>>> that yield the same effect?  From python for example you still
>>>>>>> can
>>>>>>> construct
>>>>>>> a logical plan and give to Pig to execute.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ted Dunning, CTO
>>>>>> DeepDyve
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Nitesh Bhatia
>>> Dhirubhai Ambani Institute of Information & Communication Technology
>>> Gandhinagar
>>> Gujarat
>>>
>>> "Life is never perfect. It just depends where you draw the line."
>>>
>>> visit:
>>> http://www.awaaaz.com - connecting through music
>>> http://www.volstreet.com - lets volunteer for better tomorrow
>>> http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
>>>
>
>

RE: switching to different parser in Pig

Reply via email to