On Oct 13, 2005, at 7:36 AM, Mikko Noromaa wrote:
Hi,
It would be possible to do a PatternQuery("*") that would
enumerate every term.
Does this work differently than the current logic where wildcard
queries are
constructed as BooleanQueries with many terms OR'ed together? I
think this
would be a good change.
No - it works identically to WildcardQuery, with the only difference
being how it matches. The added bonus though is that there is a
SpanPatternQuery to go along with this, allowing for "foo bar*"
phrase queries.
I have always thought that it is quite cumbersome to expand
wildcards to
many boolean clauses. I think that keeping the wildcard (or regex
in this
case) in the query object would be much better. On the other hand,
it might
not make any difference in performance, since Lucene would still
have to go
through all the terms. But at least it would avoid the
BooleanQuery$TooManyClauses exception even with thousands of different
terms. Right?
At this point, the possibility of that exception still exists so
increasing the maximum number of clauses is necessary to avoid it.
I know I can increase the limit of the boolean queries, but there
is still a
limit. In my application, I index Finnish text which has lots of
different
suffixes for the same word. With compound words included, I could
easily
imagine that the same base word may have hundreds or thousands of
terms in
the index.
Hundreds is still under the 1024 built-in restriction for
BooleanQuery. Thousands is do-able by increasing the limit and
having sufficient RAM.
For suffix-wildcards, there really is no difference between my
PatternQuery and WildcardQuery - WildcardQuery may even be faster if
it's matching is quicker than regex (though tests would need to be
performed to confirm, I'd guess that the performance difference isn't
all that much).
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]