regex-based query contribution

Erik Hatcher Wed, 12 Oct 2005 16:45:45 -0700

I've developed normal and span-based Query implementations that useregex to match index terms rather than the simplified WildcardQuery.This allows for queries like "abc[0-9]xyz" that would match abc1xyz,but not abc12xyz for example.

I've seen a lot of interest lately in being able to do a phrase querywith a nested wildcard term inside, such as "the q.*k brown f.x". Iturn a query like that into a SpanNearQuery of SpanTermQuery("the"),SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery("f.x") with a slop of 0.

The code is fairly minimal thanks to the wonderful infrastructurealready provided. I'm ready to contribute it to Lucene. Thequestion is, where? Should this be part of the core? Or should itreside in a contrib area? If in contrib, shall it be a new areacalled "regex" perhaps, or "regex-query"?

I'm inclined to put it in the core, so if I don't hear otherwise I'llstart with it there.

The main negative to this query, just like with WildcardQuery andFuzzyQuery, is the possible performance issue. However, just likeWildcardQuery, this really depends on how clever the indexing side ofthings is and matching that cleverness with an appropriate regex. Inmy actual use of these queries involves doing overlapped rotated termindexing and also rotating the query term to have the best possibleprefix for term enumeration. Naive use of this query using ".*foo"of course will have the same impact as WildcardQuery using *foo - andperhaps slightly slower with regex matching involved.

Overall, I think it is a good addition and will allow users to bemore expressive than the lower-level MultiPhraseQuery (akaPhrasePrefixQuery).


Thoughts?

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

regex-based query contribution

Reply via email to