paulirwin opened a new pull request, #996: URL: https://github.com/apache/lucenenet/pull/996
- [X] You've read the [Contributor Guide](https://github.com/apache/lucenenet/blob/main/CONTRIBUTING.md) and [Code of Conduct](https://www.apache.org/foundation/policies/conduct.html). - [X] You've included unit or integration tests for your change, where applicable. - [ ] You've included inline docs for your change, where applicable. - [X] There's an open issue for the PR that you are making. If you'd like to propose a change, please [open an issue](https://github.com/apache/lucenenet/issues/new/choose) to discuss the change or find an existing issue. Migrate ANTLRv3 to v4, and automate lexer/parser generation. Fixes #977 ## Description DRAFT NOTE: This is currently a draft PR because the tests fail to run, but wanted to get this out there in case anyone has any feedback or, if you have IL experience, can spot the bug. TODO: - [ ] Solve unit test failures - [ ] Remove old ANTLRv3 mentions - [ ] Remove old ANTLRv3 dependency overrides in csproj - [ ] Determine if we need to add an error listener to match Lucene 4.8's error-handling logic in the grammar inline Java code - [ ] Test to make sure the Antlr4BuildTasks work in CI on GitHub and Azure DevOps The ANTLRv3 NuGet package we depend on has not been maintained, and targets .NET Standard 1.6 which is reporting some vulnerabilities. Additionally, our current codebase has what appears to be hand-ported lexer and parser code from the original Java, which in some cases has been updated manually as well. The original ANTLRv3 grammar is not in our repo. Ideally, we would generate the lexer and parser using ANTLR from the grammar. This adds the 4.8 grammar to the repo, but updates its syntax to use the v4 format. Notably, ANTLRv4 removes the old v3 AST generation and instead lets you walk the parse tree however you see fit. This means that things like "root nodes" (indicated by `!`) in the old grammar are no longer necessary or supported, and thus require a different approach to walking the syntax tree. Additionally, empty lexer tokens like `AT_CALL` are no longer supported, so that use was changed into a new `call` rule instead, with corresponding `LUCENENET-specific` callout. Error handling is done another way now, resulting in not needing to add inline C# to the grammar. Other than these changes, the grammar is highly similar to the upstream 4.8 v3 grammar. I decided to stick with the 4.8 grammar instead of updating to the most recent grammar which is already in v4 format to keep it as close as possible to the 4.8 code. The added Antlr4BuildTasks NuGet package (MIT-licensed) generates the lexer and parser code at compile time now, which reduces manual error and shrinks the size of the code we have to maintain. By removing the lexer and parser from the codebase, this PR results in a net reduction of about 3k lines of code. This also is configured to generate a Listener base class, which is the v4 approach that is closest to the existing 4.8 code (as opposed to a Visitor that returns nodes, as the latest Lucene code uses). The JavascriptCompiler C# code now looks a bit different as a result of implementing the visitor, but this seemed like a cleaner and more maintainable approach than recursively walking every parse tree child. The `Context` classes that are generated for each rule are now strongly-typed, so you get the benefit of not having to compare rule/token indices. Another goal of this PR is to upgrade this dependency and change the approach in code without changing any existing tests or public API surface. As a result, the listener implementation is a private nested class instead of public, and this also helps keep the logic mostly in line with where it was before (just a little out of order). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@lucenenet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org