You've done quite a thorough analysis of Lucene. I'll reply below with a few tidbits of Lucene trivia in hopes that will help....
On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote:
One of our applications is a catalog search application. In this application our documents are catalog items. Each item has a number of fields/attributes associated with it. For example Supplier, Part number, Price, Description. We use a search metaphor where end-users iterate issuing queries and getting feedback about what's available. So initially we may tell them that 600,000 items are available from 95 suppliers, and who those suppliers are. They may choose to do a free text search for the phrase "blue pen". The result of that query may be to tell them that there's 240 items available from 2 suppliers which match that phrase, and who those suppliers are. They may pick one of the suppliers to see the list of "blue pens" available from that supplier.
To accomplish "search within search", or "search refinement", using a QueryFilter will do very nicely.
In addition to wanting the set of attribute values found in the result documents we would also want to return counts of the number of documents each attribute value occurs in in the result document set.
Again, I think a QueryFilter can work well. There are surely several ways to go about getting the number of documents in each bucket - perhaps additional queries should be made to give you those numbers, or perhaps walking the returned documents to get the unique values. Walking the documents could be expensive performance-wise though. Doing some sub-queries would be quite fast though.
Efficient range queries.
application) it's important to have some support for this. The trick here is that the criteria may be very open ended. For example all items with price greater than $10 might involve tens of thousands of prices.
One suggestion I've seen posted is during indexing to use an additional field as a "group". In this case, it would be a price range group. Say "A" means $0 - $10, "B" for $10 - $100, "C" for $100+, for example. Then you would only have a few terms in that field and a query would be quite fast. The drawback is that you need to know at index-time what the groups are.
A custom range Filter is another option - and could be created at runtime and kept around and only recreated when the index is modified. Look at the built-in DateFilter for an example to work with. This is a more pleasant option than doing a RangeQuery when the number of terms in the range is large.
Order by attributes.
We need the ability to order the document results set by a pre-defined set of numeric attributes and would like the ability to order on alphabetic attributes as well.
This is an area where Lucene falls short. My best suggestion is to do the sorting yourself, which would require getting at all the documents in Hits, which for a large collection would be unreasonable. There are tricks that can be played with boosting during indexing where you can tier the boosts of a field in order - but this is really only a hint to the scorer to factor the order into the equation but there are many other factors.
I'm afraid there is no easy solution here, that I'm aware of.
I have resources for code development and consider it to be in Ariba's best interest to contribute any code that we write in this area with the entire community. Our time frame is to develop a proto-type in the next couple of months for proof of concept and benchmarking.
Excellent! We hope that we can get Lucene under the covers of your products - please continue to post to us with more questions and hopefully eventually code improvements!
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
