[
https://issues.apache.org/jira/browse/GEODE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jinwoo Hwang updated GEODE-10508:
---------------------------------
Fix Version/s: 2.1.0
> Remediation of ANTLR Nondeterminism Warnings in OQL Grammar
> -----------------------------------------------------------
>
> Key: GEODE-10508
> URL: https://issues.apache.org/jira/browse/GEODE-10508
> Project: Geode
> Issue Type: Improvement
> Reporter: Jinwoo Hwang
> Assignee: Jinwoo Hwang
> Priority: Major
> Fix For: 2.1.0
>
>
> h2. Summary
> Fix nondeterminism warnings generated by ANTLR during compilation of the OQL
> (Object Query Language) grammar file. The warnings indicate ambiguity in the
> parser that prevents it from deterministically choosing between alternative
> production rules.
> h2. Problem Statement
> h3. Current Behavior
> During the {{generateGrammarSource}} task, ANTLR produces the following
> nondeterminism warnings:
> {noformat}
> > Task :geode-core:generateGrammarSource
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:574:40:
>
> warning:nondeterminism between alts 1 and 2 of block upon
> k==1:"sum","avg","min","max","count"
> k==2:TOK_LPAREN
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:578:13:
>
> warning:nondeterminism between alts 1 and 2 of block upon
> k==1:"sum","avg","min","max","count"
> k==2:TOK_LPAREN
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:961:26:
>
> warning:nondeterminism between alts 1 and 2 of block upon
> k==1:"distinct"
> k==2:TOK_LPAREN,TOK_PLUS,TOK_MINUS,TOK_DOLLAR,...
> /home/runner/work/geode/geode/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g:979:17:
>
> warning:nondeterminism between alts 1 and 2 of block upon
> k==1:"distinct"
> k==2:TOK_LPAREN,TOK_PLUS,TOK_MINUS,TOK_DOLLAR,...
> {noformat}
> h3. Root Cause Analysis
> *Lines 574 & 578 (projection rule):*
> The parser cannot distinguish between {{aggregateExpr}} and {{expr}}
> alternatives when it encounters aggregate function keywords ({{{}sum{}}},
> {{{}avg{}}}, {{{}min{}}}, {{{}max{}}}, {{{}count{}}}) because:
> * These keywords can start an aggregate expression like {{sum(field)}}
> * The same keywords can also be used as identifiers in regular expressions
> * Without lookahead, the parser cannot determine which production rule to
> apply
> *Lines 961 & 979 (aggregateExpr rule):*
> The optional {{distinct}} keyword creates ambiguity:
> * The parser cannot decide whether to match the optional {{distinct}}
> keyword or skip it and proceed directly to the expression
> * This is because both alternatives are valid, and the default behavior
> doesn't specify preference
> h2. Proposed Solution
> h3. 1. Add Syntactic Predicates (Lines 574 & 578)
> Use ANTLR syntactic predicates to perform lookahead and resolve ambiguity in
> the {{projection}} rule:
> {code}
> Unable to find source-code formatter for language: antlr. Available languages
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go,
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc,
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml,
> yaml// Before:
> ( tok1:aggregateExpr{node = #tok1;} | tok2:expr{node = #tok2;})
> // After:
> ( (("sum"|"avg"|"min"|"max"|"count") TOK_LPAREN)=> tok1:aggregateExpr{node =
> #tok1;} | tok2:expr{node = #tok2;})
> {code}
> The predicate {{(("sum"|"avg"|"min"|"max"|"count") TOK_LPAREN)=>}} tells the
> parser:
> * Look ahead to check if an aggregate keyword is followed by a left
> parenthesis
> * If true, choose the {{aggregateExpr}} alternative
> * Otherwise, choose the {{expr}} alternative
> h3. 2. Add Greedy Option (Lines 961 & 979)
> Use the {{greedy}} option for optional {{distinct}} keywords in the
> {{aggregateExpr}} rule:
> {code}
> Unable to find source-code formatter for language: antlr. Available languages
> are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go,
> groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc,
> perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml,
> yaml// Before:
> TOK_LPAREN ("distinct"! {distinctOnly = true;} ) ? tokExpr1:expr TOK_RPAREN
> // After:
> TOK_LPAREN (options {greedy=true;}: "distinct"! {distinctOnly = true;} ) ?
> tokExpr1:expr TOK_RPAREN
> {code}
> The {{greedy}} option instructs the parser to match the {{distinct}} keyword
> whenever it appears, removing ambiguity.
> h3. 3. Fix Test Compatibility Issue
> The grammar changes will alter token numbering in the generated lexer. Update
> {{AbstractCompiledValueTestJUnitTest.java}} to use constants instead of
> hardcoded token values:
> {code:java}
> // Before:
> new CompiledJunction(new CompiledValue[] {compiledValue1, compiledValue2}, 89)
> // After:
> import org.apache.geode.cache.query.internal.parse.OQLLexerTokenTypes;
> ...
> new CompiledJunction(new CompiledValue[] {compiledValue1, compiledValue2},
> OQLLexerTokenTypes.LITERAL_or)
> {code}
> *Rationale:* Adding syntactic predicates changes token numbering (e.g.,
> {{LITERAL_or}} changes from 89 to 94). Using the constant ensures tests
> remain correct regardless of future grammar modifications.
> h2. Implementation Plan
> h3. Phase 1: Grammar Modifications
> # Update
> {{geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g}}
> # Add syntactic predicates to the {{projection}} rule (2 locations)
> # Add {{greedy}} option to {{aggregateExpr}} rule (2 locations)
> # Add inline comments explaining the reasoning for each change
> h3. Phase 2: Test Updates
> # Update {{AbstractCompiledValueTestJUnitTest.java}}
> # Add import for {{OQLLexerTokenTypes}}
> # Replace hardcoded value {{89}} with {{OQLLexerTokenTypes.LITERAL_or}}
> # Add comments explaining why the constant is used
> h3. Phase 3: Validation
> # Run {{./gradlew :geode-core:generateGrammarSource}} to verify no warnings
> # Run {{./gradlew :geode-core:test --tests
> "org.apache.geode.cache.query.internal.parse.*"}} to verify parser tests pass
> # Run {{./gradlew :geode-core:test --tests
> "org.apache.geode.cache.query.internal.Abstract*"}} to verify updated test
> passes
> # Run full {{geode-core}} test suite to ensure no regressions
> h2. Expected Outcomes
> h3. Success Criteria
> (/) Zero nondeterminism warnings from ANTLR grammar generation
> (/) All existing query parsing tests pass
> (/) {{AbstractCompiledValueTestJUnitTest}} passes with updated token usage
> (/) No behavior changes to OQL query parsing or execution
> (/) Token numbering is stable and predictable
> h3. Impact Assessment
> *Risk Level:* Low
> * Changes only affect parser generation, not runtime behavior
> * Syntactic predicates and greedy options are standard ANTLR features
> * All test coverage validates correct parsing behavior
> *Performance Impact:* None
> * Grammar changes only affect compile-time parser generation
> * No runtime performance impact
> *Compatibility:* Maintained
> * No changes to OQL syntax or semantics
> * Fully backward compatible with existing queries
> h2. Files to be Modified
> #
> {{geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g}}
> (Grammar file)
> #
> {{geode-core/src/test/java/org/apache/geode/cache/query/internal/AbstractCompiledValueTestJUnitTest.java}}
> (Test file)
> h2. References
> * [ANTLR 2 Documentation - Syntactic
> Predicates|https://www.antlr2.org/doc/predicates.html]
> * [ANTLR 2 Documentation - Greedy
> Subrules|https://www.antlr2.org/doc/options.html]
> * Apache Geode OQL Documentation
> h2. Testing Strategy
> h3. Unit Tests
> * Verify all query parser tests pass
> * Verify aggregate function parsing works correctly
> * Verify distinct keyword handling in aggregate functions
> h3. Integration Tests
> * Run OQL query tests with aggregate functions
> * Test queries with projection containing aggregate expressions
> * Test queries mixing aggregate and non-aggregate projections
> h3. Regression Tests
> * Run full geode-core test suite
> * Verify no changes to generated parser behavior
> * Confirm token numbering stability
--
This message was sent by Atlassian Jira
(v8.20.10#820010)