regex queries

Erik Hatcher Sat, 12 Nov 2005 01:28:51 -0800

For a consulting engagement, the client needed the ability to queryusing regular expressions. I was given permission to contribute itto Lucene and have just committed it to the trunk. This is notrevolutionary at all, and is implemented in the same manner thatWildcardQuery is implemented except that the matching uses regularexpressions instead of just basic */? matching. This implementationcarries with it the same performance caveats that WildcardQuery has,and is possibly even slower than WildcardQuery for the same types ofexpressions (I haven't benchmarked it - it isn't meant to replaceWildcardQuery as the pattern syntax is different anyway).


Here is a test case showing RegexQuery in action:


public class TestRegexQuery extends TestCase {
  public void testRegex() throws Exception {
    RAMDirectory directory = new RAMDirectory();

IndexWriter writer = new IndexWriter(directory, newSimpleAnalyzer(), true);

    Document doc = new Document();

doc.add(new Field("field", "the quick brown fox jumps over thelazy dog", Field.Store.NO, Field.Index.TOKENIZED));

    writer.addDocument(doc);
    writer.optimize();
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);
    Query query = new RegexQuery(new Term("field", "q.[aeiou]c.*"));
    Hits hits = searcher.search(query);
    assertEquals(1, hits.length());
  }
}

The standard Java 1.4 built-in regex Pattern matching is used underthe covers.

Beyond the basic RegexQuery, there is also a SpanRegexQuery allowingfor sophisticated expressions using spans and regular expressionmatching all together, as shown in this test case:


public class TestSpanRegexQuery extends TestCase {
  public void testSpanRegex() throws Exception {
    RAMDirectory directory = new RAMDirectory();

IndexWriter writer = new IndexWriter(directory, newSimpleAnalyzer(), true);

    Document doc = new Document();

doc.add(new Field("field", "the quick brown fox jumps over thelazy dog", Field.Store.NO, Field.Index.TOKENIZED));

    writer.addDocument(doc);
    writer.optimize();
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);

SpanRegexQuery srq = new SpanRegexQuery(new Term("field", "q.[aeiou]c.*"));

    SpanTermQuery stq = new SpanTermQuery(new Term("field","dog"));

SpanNearQuery query = new SpanNearQuery(new SpanQuery[] {srq,stq}, 6, true);

    Hits hits = searcher.search(query);
    assertEquals(1, hits.length());
  }
}

There is one fiddlying improvement that is needed under the covers.For this type of query, as with WildcardQuery also, it is vastly moreefficient to find the maximum prefix of the regex expression thatdoes not contain any special regex characters in order to narrow downthe terms enumerated for consideration. The current logic is a bittoo simplistic and error prone - it simply looks for first occurrenceof any of these characters: "*[?.", such that a query for "abc.*"would start the term enumeration at "abc" in the term dictionaryrather than scanning all terms. A pattern such as "abc\*" currentlybreaks this logic and starts the term enumeration at "abc\" which is,of course, incorrect. If anyone would like to contribute code tohandle this better, I welcome it!

Further on the use of regular expression searching - while this querysupports doing this on a standard index, if pattern (wildcard, regex)querying is crucial to your application, consider using term rotationduring indexing and clever query creation to optimize further as itcan greatly improve query performance.


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

regex queries

Reply via email to