I think it's not something unacceptable to require parser developers to know 
the usage of LOOKAHEAD hints of JavaCC, for keeping the parser efficient and 
ambiguous free. 
Firstly, it's not really that difficult to learn. By setting the global 
LOOKAHEAD to 1, JavaCC will perform ambiguity checking automatically during the 
maven build lifecycle. If someone changed the parser all he needs to do is to 
add enough LOOKAHEAD hints following the warning message, it is not difficult 
for a developer, I think.
And there are really not that many LOOKAHEAD hints need to be added. In current 
version of Calcite, there are already 60 LOOKAHEAD hints in Parser.jj[1], then 
I have added 37 new ones[2] in the PR to make things work. also, Parser.jj has 
about 6800 lines of code now. Which means, averagely we need one new LOOKAHEAD 
per 70 new lines, this is not much.
Also, the smaller LOOKAHEAD is, more readable the ParseException could be. You 
can see some "fixme"s in SqlParserTest about ParseException's error message get 
healed by the change[3].

Best,
Hongze


[1] 
https://github.com/apache/calcite/blob/master/core/src/main/codegen/templates/Parser.jj
[2] 
https://github.com/apache/calcite/blob/656608339f901fd5a6b919ee79a37ba25c16b0d2/core/src/main/codegen/templates/Parser.jj
[3] 
https://github.com/apache/calcite/pull/1041/files#diff-ef277e7d229cfc747255eb61ce6fd46dL1386



Hongze
 
From: Julian Hyde
Date: 2019-02-14 04:12
To: dev
Subject: Re: Global LOOKAHEAD of the SQL parser
The performance improvements are impressive - basically the parser got 10x 
faster.
 
However, the parser is now more difficult to develop. Is this worth the speedup?
 
Julian
 
 
> On Feb 13, 2019, at 5:30 AM, Hongze Zhang <notify...@126.com> wrote:
> 
> Thank you very much for the quick response, Julian!
> 
> 
> The parse will be broken if we simply set LOOKAHEAD to 1. There will be a lot 
> of lookahead warning that are newly produced or hidden before (JavaCC does 
> not perform LA check when global LOOKAHEAD is larger than 1[1]), and many 
> test cases will turn to fail. I could run the benchmark because the SQL 
> generated by the benchmark is not touching the broken part of the parser. 
> For solving the problem deeply I have done some tentative work and opened a 
> JIRA issue[2] with a PR.
> And I didn't change the global LOOKAHEAD for Babel in the PR. This is because 
> Babel turns a lot of reserved keyword to non-reserved[3], if we set the 
> LOOKAHEAD to 1 JavaCC will generate more choice conflict than the default 
> parser for Babel[4]. To deal with this we must add extra lookahead hints into 
> default parser's code, and the default parser will lose performance by the 
> hints. I would suggest to take the concept that "Babel is powerful but 
> slower" and not optimize Babel. But if there is a way we can easily solve the 
> conflicts without touching default parser I would be a fan of optimizing 
> Babel too. What do you think?
> 
> 
> Best,
> Hongze
> 
> 
> 
> 
> [1] https://javacc.org/javaccgrm
> [2] https://issues.apache.org/jira/browse/CALCITE-2847
> [3] 
> https://github.com/apache/calcite/blob/2102f1f5442fa271c258b7754da8df07d65847ec/babel/src/main/codegen/config.fmpp#L338
> [4] https://www.dropbox.com/s/wpap95dko4wnlcc/babel_warnings.log?dl=0
> At 2019-02-13 04:11:33, "Julian Hyde" <jh...@apache.org> wrote:
>> Does the parser produce correct results if you set LOOKAHEAD to 1? If so, we 
>> should use that.
>> 
>> Do the extension parsers (e.g. Babel) also work with that setting?
>> 
>> We thought we were using as little lookahead as we could get away with, but 
>> maybe we were wrong.
>> 
>>> On Feb 12, 2019, at 8:24 AM, Hongze Zhang <notify...@126.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> 
>>> Recently I have spent some time on playing with Calcite's built-in SQL 
>>> parsers. And now I'm interested with the reason why the global LOOKAHEAD is 
>>> set to 2[1] by default.
>>> I run a comparative benchmark using the ParserBenchmark util class, and the 
>>> output log shows visible parsing performance improvement after setting 
>>> global LOOKAHEAD to 1. For example, the metric 
>>> "ParserBenchmark.parseCached" reduced from 1693.821 us/op ± 1921.925 
>>> us/op[2] to 655.452 ± 181.100 us/op[3].
>>> 
>>> 
>>> JavaCC always generates java methods like jj_2_**(...) for LL(k) (k > 1) 
>>> grammar[4], I am almost sure this way is not efficient comparing with the 
>>> generated code for LL(1) grammar. It might be great If we can somehow take 
>>> the advantage of JavaCC's LL(1). 
>>> And of course I could see there was some trade-off consideration about not 
>>> using LL(1) by default. Then I have done some search on dev list and JIRA 
>>> cases but found nothing. Does anyone hold information about that?
>>> 
>>> 
>>> Best,
>>> Hongze
>>> 
>>> 
>>> [1] 
>>> https://github.com/apache/calcite/blob/883666929478aabe07ee5b9e572c43a6f1a703e2/core/pom.xml#L304
>>> [2] https://www.dropbox.com/s/il6nodc44dzo0rz/bench_la2.log?dl=0
>>> [3] https://www.dropbox.com/s/4rrou71siskdhhm/bench_la1.log?dl=0
>>> [4] 
>>> http://www.cs.tau.ac.il/~msagiv/courses/lab/Shai2/tools/javacc/examples/JavaGrammars/OPTIMIZING
>>> 

Reply via email to