I've developed normal and span-based Query implementations that use regex to match index terms rather than the simplified WildcardQuery. This allows for queries like "abc[0-9]xyz" that would match abc1xyz, but not abc12xyz for example.

I've seen a lot of interest lately in being able to do a phrase query with a nested wildcard term inside, such as "the q.*k brown f.x". I turn a query like that into a SpanNearQuery of SpanTermQuery("the"), SpanPatternQuery("q.*k"), SpanTermQuery("brown"), and SpanPatternQuery ("f.x") with a slop of 0.

The code is fairly minimal thanks to the wonderful infrastructure already provided. I'm ready to contribute it to Lucene. The question is, where? Should this be part of the core? Or should it reside in a contrib area? If in contrib, shall it be a new area called "regex" perhaps, or "regex-query"?

I'm inclined to put it in the core, so if I don't hear otherwise I'll start with it there.

The main negative to this query, just like with WildcardQuery and FuzzyQuery, is the possible performance issue. However, just like WildcardQuery, this really depends on how clever the indexing side of things is and matching that cleverness with an appropriate regex. In my actual use of these queries involves doing overlapped rotated term indexing and also rotating the query term to have the best possible prefix for term enumeration. Naive use of this query using ".*foo" of course will have the same impact as WildcardQuery using *foo - and perhaps slightly slower with regex matching involved.

Overall, I think it is a good addition and will allow users to be more expressive than the lower-level MultiPhraseQuery (aka PhrasePrefixQuery).

Thoughts?

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to