[PR] Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 [lucenenet]

via GitHub Fri, 25 Oct 2024 09:36:49 -0700


paulirwin opened a new pull request, #996:
URL: https://github.com/apache/lucenenet/pull/996


   - [X] You've read the [Contributor 
Guide](https://github.com/apache/lucenenet/blob/main/CONTRIBUTING.md) and [Code 
of Conduct](https://www.apache.org/foundation/policies/conduct.html).
   - [X] You've included unit or integration tests for your change, where 
applicable.
   - [ ] You've included inline docs for your change, where applicable.
   - [X] There's an open issue for the PR that you are making. If you'd like to 
propose a change, please [open an 
issue](https://github.com/apache/lucenenet/issues/new/choose) to discuss the 
change or find an existing issue.
   
   Migrate ANTLRv3 to v4, and automate lexer/parser generation.
   
   Fixes #977
   
   ## Description
   
   DRAFT NOTE: This is currently a draft PR because the tests fail to run, but 
wanted to get this out there in case anyone has any feedback or, if you have IL 
experience, can spot the bug.
   
   TODO:
   - [ ] Solve unit test failures
   - [ ] Remove old ANTLRv3 mentions
   - [ ] Remove old ANTLRv3 dependency overrides in csproj
   - [ ] Determine if we need to add an error listener to match Lucene 4.8's 
error-handling logic in the grammar inline Java code
   - [ ] Test to make sure the Antlr4BuildTasks work in CI on GitHub and Azure 
DevOps
   
   The ANTLRv3 NuGet package we depend on has not been maintained, and targets 
.NET Standard 1.6 which is reporting some vulnerabilities. Additionally, our 
current codebase has what appears to be hand-ported lexer and parser code from 
the original Java, which in some cases has been updated manually as well. The 
original ANTLRv3 grammar is not in our repo. Ideally, we would generate the 
lexer and parser using ANTLR from the grammar.
   
   This adds the 4.8 grammar to the repo, but updates its syntax to use the v4 
format. Notably, ANTLRv4 removes the old v3 AST generation and instead lets you 
walk the parse tree however you see fit. This means that things like "root 
nodes" (indicated by `!`) in the old grammar are no longer necessary or 
supported, and thus require a different approach to walking the syntax tree. 
Additionally, empty lexer tokens like `AT_CALL` are no longer supported, so 
that use was changed into a new `call` rule instead, with corresponding 
`LUCENENET-specific` callout. Error handling is done another way now, resulting 
in not needing to add inline C# to the grammar. Other than these changes, the 
grammar is highly similar to the upstream 4.8 v3 grammar. I decided to stick 
with the 4.8 grammar instead of updating to the most recent grammar which is 
already in v4 format to keep it as close as possible to the 4.8 code.
   
   The added Antlr4BuildTasks NuGet package (MIT-licensed) generates the lexer 
and parser code at compile time now, which reduces manual error and shrinks the 
size of the code we have to maintain. By removing the lexer and parser from the 
codebase, this PR results in a net reduction of about 3k lines of code. This 
also is configured to generate a Listener base class, which is the v4 approach 
that is closest to the existing 4.8 code (as opposed to a Visitor that returns 
nodes, as the latest Lucene code uses).
   
   The JavascriptCompiler C# code now looks a bit different as a result of 
implementing the visitor, but this seemed like a cleaner and more maintainable 
approach than recursively walking every parse tree child. The `Context` classes 
that are generated for each rule are now strongly-typed, so you get the benefit 
of not having to compare rule/token indices. Another goal of this PR is to 
upgrade this dependency and change the approach in code without changing any 
existing tests or public API surface. As a result, the listener implementation 
is a private nested class instead of public, and this also helps keep the logic 
mostly in line with where it was before (just a little out of order).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 [lucenenet]

Reply via email to