[ https://issues.apache.org/jira/browse/LUCENE-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732060#action_12732060 ]
Steven Rowe edited comment on LUCENE-1683 at 7/16/09 11:12 AM: --------------------------------------------------------------- bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? JavaUtilRegexCapabilities.match() is implemented as j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*". The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile". The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() instead of lookingAt(). was (Author: steve_rowe): bq. ... why is RegexQuery treating the trailing "." as a ".*" instead? JavaUtilRegexCapabilities.match() is implemented as j.u.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*". The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile". The fix is to switch JavaUtilRegexCapabilities.match to use j.u.Matcher.match() instead of lookingAt(). > RegexQuery matches terms the input regex doesn't actually match > --------------------------------------------------------------- > > Key: LUCENE-1683 > URL: https://issues.apache.org/jira/browse/LUCENE-1683 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Affects Versions: 2.3.2 > Reporter: Trejkaz > > I was writing some unit tests for our own wrapper around the Lucene regex > classes, and got tripped up by something interesting. > The regex "cat." will match "cats" but also anything with "cat" and 1+ > following letters (e.g. "cathy", "catcher", ...) It is as if there is an > implicit .* always added to the end of the regex. > Here's a unit test for the behaviour I would expect myself: > @Test > public void testNecessity() throws Exception { > File dir = new File(new File(System.getProperty("java.io.tmpdir")), > "index"); > IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), > true); > try { > Document doc = new Document(); > doc.add(new Field("field", "cat cats cathy", Field.Store.YES, > Field.Index.TOKENIZED)); > writer.addDocument(doc); > } finally { > writer.close(); > } > IndexReader reader = IndexReader.open(dir); > try { > TermEnum terms = new RegexQuery(new Term("field", > "cat.")).getEnum(reader); > assertEquals("Wrong term", "cats", terms.term()); > assertFalse("Should have only been one term", terms.next()); > } finally { > reader.close(); > } > } > This test fails on the term check with terms.term() equal to "cathy". > Our workaround is to mangle the query like this: > String fixed = String.format("(?:%s)$", original); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org