Hi Vijay,
I have hit the same problem in the past and have evaluated various techniques to solve the same.
1. Using the QueryFilter
The idea is to
   a) create BitSets for each category once initially
   b) run the search and extract the BitSet for the search results
   c) Logically "AND" the result set with the category sets
   d) find the cardinality of each such result and finally display
This was working just fine in my scenario but was not scalable. The performance decreased with the increase in the number categories. (because of the "AND"ing in the loop)

2. Override the collect method of the HitCollector.
This method is called by lucene for every document in the search results.
The idea is to:
a) override the method to use a HashMap (this works just fine for me) for the category to count (hits) mapping b) just keep incrementing the count for each category as and when it is encountered in the search results. c) the HashMap can be blank in the beginning and new categories can be added to it when encountered.

I am currently using the second method and it works.

Hope this helps.

Regards,
kapilChhabra


Vijay Santhanam wrote:
Hi Lucene Users!

I've been playing around with dotLucene on a few projects since for about 4
months, and I've found Lucene to be exceptionally powerful, speedy and
thanks to LIA, really easy to use.
But I've hit a problem that I fear will pose a performance problem for our
architecture and Lucene installation.

We have an index of about 100,000 documents with about 30 fields, built from
our database.

Each document in the index contains a TOKENIZED field of Category Names, so
that each document can belong to many categories. The category field is a
tokenized string field.

We have a new requirement to not only allow searches across the whole index,
but to return the number of documents in each of the (150) possible
categories. This is like in an Amazon search
(http://amazon.com/s/ref=nb_ss_gw/105-0072880-3737226?url=search-alias%3Daps
&field-keywords=diamond&Go.x=0&Go.y=0&Go=Go), where a category list is
presented on the left with the number of results in each category.

So far, I can think of two possible ways to implement this:

1.      Create a QueryFilter for the user enterered query, and perform a
category field search for each category.
2.      Create a separate index for each category, and sequentially (or
concurrently) search across all the indexes. Does anyone know which solution is better than the other?
Both solutions seem taxing to me because they both involve "number of
categories + 1" searches.

Regards,

 -V




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to