This is pretty much the same problem that many of us have faced when
it comes to faceted browsing. I'm using a set of cached BitSet's
that represent the documents that have a specific category (or
general "facet" in my case). I do a full-text search for "some query
expression", using QueryFilter to get the BitSet for the query. Then
I AND the Hits BitSet with each of the facet BitSet's, and the
cardinality of each gives me the number in each category that matches
the query. I load up these BitSet's when my search server is
launched. In my case I'm currently dealing with about 30k documents,
with maybe 100 unique facet values, and these load in the blink of an
eye.
I realize the above description was void of code specifics, but the
gist is there. Hope it helps.
Erik
On Dec 15, 2005, at 8:16 PM, Mr Plate wrote:
This puzzle has been bugging me for a while; I'm hoping there's an
elegant way to handle it in Lucene.
DATA DESCRIPTION:
I've got an index of over 100,000 Documents. In addition to other
fields, each of these Documents has 0 or more "category" field
values. There are over 5,500 such categories (it's not a small
set). Anywhere from 1 to 500+ Documents could belong to a single
"category". This index does not get updated very often; anywhere
from once a day to once a month. Indexing time is currently 15-30
minutes from start to finish/optimization.
PROBLEM:
I'd like to provide users a way to search these "category" values.
For example, suppose the user searches for "fiction". They might
see results of: { "fiction", "non-fiction" }. However, I'd like to
do this search as quickly and efficiently as reasonable. For
example, if there are 500 Documents of category "fiction", and 400
of "non-fiction", I don't want to Sort and iterate through each Hit
to weed out the duplicate values from my query.
For what it's worth, I imagine only 0-20 categories would match a
given query.
SIMPLEST SOLUTION I CAN THINK OF:
The best I can imagine is to maintain a separate Lucene index for
each of these category types. Each Document in this separate index
would probably have fields of "field_name", and "field_value", and
would not contain any duplicates. For example, you might see a
Document of field_name "category" and field_value "non-fiction". My
query would hit this second index instead, to perform these
metadata searches.
I hope that makes sense; do you know of a more elegant way to
handle this type of problem?
Thanks,
Tyler
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]