On Jul 13, 2005, at 6:21 AM, MariLuz Elola wrote:
I have been readed about "Too many clauses"........... If the
max was set too high, the inefficiency would make the search unsable.
I am testing the performance of Lucene and the time that spend
Lucene in searching is too high. Moreover I´ve got OutOfMemory
error several times.....
I am speaking about an index with 250.000 documents more or
less, but in the future will be necessary an index with millions of
documents.
These are the kinds of queries:
1. Greater than or lower than request
RangeQuery with Integer.MAX_VALUE for greater than or
Integer.MIN_VALUE for lower than
2. RangeQuery
Example:
Field:[minValue to maxValue]
Keep in mind that dealing with numeric information requires some
adjustments both at how you index and how RangeQuerys are formed.
For example, if you index "1" through "10" doing a RangeQuery of [1
TO 5] will also find "10" unless you account for it with a special
QueryParser subclass.
3.WildcardQuery
Example:
Field:value*
ect....
The problem is that PrefixQuery,WildcardQuery,RangeQuery and
FuzzyQuery all expand to a series of OR'ed boolean queries.
I have read about BitSetQuery, FilteringQuery,
ConstrantScoreQuery.......... I am confused!!!!!!
There certainly are lots of options. The Query classes you mention,
though, are not currently exposed via QueryParser, so you would need
to subclass QueryParser to have them created instead, or create your
own parser, or mix and match some query expression parsing and join
it with some API created Querys via BooleanQuery.
I can´t use a Filter (DateFilter, QueryFilter ect...) because the
client wants to search for all the documents without filter for
anything.
This doesn't make sense to me. Implicitly the user is "filtering"
documents by adding constraints to a query expression using
Field:value* or Field:[min TO max].
I can´t divide a field in subfields to do the query more specific.
For example, the user wants the date with format YYYMMDDHHMMSS, not
6 fields, one with the year, one with the month, one with the day,
one with de hour ect....
The index structure needs to be a bit more abstracted from the user
in your case, it seems. The user does not need to know explicitly
that the index is split into multiple fields for dates in order to
make searching more efficient. If the user is not doing queries down
to the second level, but rather always at the day level, then you
can build the index to account for that type of usage and improve the
experience.
I encourage you to reconsider your "can't"'s and investigate
alternative approaches. Such considerations might be - does the user
really need FuzzyQuery? Are WildcardQuery's desired? If so, what
types of wildcard queries are needed? (this can affect how you index
and construct queries - a WildcardQuery literally is not the only way
to achieve the same sort of thing, as has been mentioned using a
PhraseQuery for numeric information) Can the user interface be
crafted to be more structured rather than just a Google-like search
box where the user has to enter field selectors and know QueryParser
voodoo? (perhaps the date field constraint can use a date picker
rather than a textual expression?)
My question is very simple...... Is it possible to use Lucene like
full text search engine with the environment I have explained
before, with the server that I have explained before, and doing the
queries that I have explained before with an efficient performance
and without OutOfMemoryError????
Short answer: yes.
Longer answer: see above for some techniques to consider
Erik