Re: How to retrieve distinct field matches?

Erik Hatcher Fri, 16 Dec 2005 00:56:27 -0800

This is pretty much the same problem that many of us have faced whenit comes to faceted browsing. I'm using a set of cached BitSet'sthat represent the documents that have a specific category (orgeneral "facet" in my case). I do a full-text search for "some queryexpression", using QueryFilter to get the BitSet for the query. ThenI AND the Hits BitSet with each of the facet BitSet's, and thecardinality of each gives me the number in each category that matchesthe query. I load up these BitSet's when my search server islaunched. In my case I'm currently dealing with about 30k documents,with maybe 100 unique facet values, and these load in the blink of aneye.

I realize the above description was void of code specifics, but thegist is there. Hope it helps.


        Erik



On Dec 15, 2005, at 8:16 PM, Mr Plate wrote:

This puzzle has been bugging me for a while; I'm hoping there's anelegant way to handle it in Lucene.
DATA DESCRIPTION:
I've got an index of over 100,000 Documents. In addition to otherfields, each of these Documents has 0 or more "category" fieldvalues. There are over 5,500 such categories (it's not a smallset). Anywhere from 1 to 500+ Documents could belong to a single"category". This index does not get updated very often; anywherefrom once a day to once a month. Indexing time is currently 15-30minutes from start to finish/optimization.
PROBLEM:
I'd like to provide users a way to search these "category" values.For example, suppose the user searches for "fiction". They mightsee results of: { "fiction", "non-fiction" }. However, I'd like todo this search as quickly and efficiently as reasonable. Forexample, if there are 500 Documents of category "fiction", and 400of "non-fiction", I don't want to Sort and iterate through each Hitto weed out the duplicate values from my query.
For what it's worth, I imagine only 0-20 categories would match agiven query.
SIMPLEST SOLUTION I CAN THINK OF:
The best I can imagine is to maintain a separate Lucene index foreach of these category types. Each Document in this separate indexwould probably have fields of "field_name", and "field_value", andwould not contain any duplicates. For example, you might see aDocument of field_name "category" and field_value "non-fiction". Myquery would hit this second index instead, to perform thesemetadata searches.
I hope that makes sense; do you know of a more elegant way tohandle this type of problem?
Thanks,

Tyler

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to retrieve distinct field matches?

Reply via email to