dan-robertson opened a new issue, #15182:
URL: https://github.com/apache/lucene/issues/15182

   ### Description
   
   Consider a regex of the form `a(b|)c`. This should match `ac` and `abc`. 
Some regexes of this form seem to behave incorrectly. To try to make a slightly 
more concrete or realistic example, one might want to search some logs for 
something like `[a-z-]*(|-prod|-main) `.
   
   I cobbled together some tests against a recent version of the lucene repo. I 
have three docs which respectively contain:
   - `foo-bar-baz`
   - `foo--baz`
   - `foo-test-baz`
   
   And then I spell a regex to match only the first two docs a bunch of 
different ways:
   
   1. `.*foo-(bar|)-baz.*` - IllegalArgumentException: expected ')' at position 
18
   2. `.*foo-(|bar)-baz.*` - 0 matches but 2 expected
   3. `.*(foo-(bar|)-baz).*` - IllegalArgumentException: expected ')' at 
position 20
   4. `.*(foo-(|bar)-baz).*` - 0 matches but 2 expected
   5. `.*foo-(bar|())-baz.*` - 2 matches
   6. `.*foo-(bar|()?)-baz.*` - 2 matches
   7. `.*foo-(bar|#?)-baz.*` - 2 matches
   8. `.*(foo-(bar|())-baz).*` - 2 matches
   9. `.*(foo-(bar|()?)-baz).*` - 2 matches
   10. `.*(foo-(bar|#?)-baz).*` - 2 matches
   11. `.*foo-(bar)?-baz.*` - 2 matches
   
   The first four cases seem incorrect to me.
   
   I came to this investigation after some problems with elasticsearch 
(v8.12.2, using lucene 9.9.2) where regexes following pattern number 5 also 
failed. Maybe that is some useful context.
   
   ### Version and environment details
   
   I added tests by modifying 
`lucene/core/src/test/org/apache/lucene/search/TestRegexpQuery.java` with a 
base revision of `cd1a4ecc9ead8e06b08f3bc2016297525f65b37c`.
   
   Here are other details in case they are relevant:
   OS: linux (el8 with a 6.12.39 kernel)
   Java: openjdk version "24.0.1" 2025-04-15
   This is on an x86_64 box.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to