Geoffrey,

You've done quite a thorough analysis of Lucene. I'll reply below with a few tidbits of Lucene trivia in hopes that will help....

On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote:
One of our applications is a catalog search
application.    In this application our documents are
catalog items.   Each item has a number of
fields/attributes associated with it.   For example
Supplier, Part number, Price, Description.   We use a
search metaphor where end-users iterate issuing
queries and getting feedback about what's available.
So initially we may tell them that 600,000 items are
available from  95 suppliers, and who those suppliers
are.   They may choose to do a free text search for
the phrase "blue pen".  The result of that query may
be to tell them that there's 240 items available from
2 suppliers which match that phrase, and who those
suppliers are.  They may pick one of the suppliers to
see the list of "blue pens" available from that
supplier.

To accomplish "search within search", or "search refinement", using a QueryFilter will do very nicely.


In addition to wanting the set of attribute values
found in the result documents we would also want to
return counts of the number of documents each
attribute value occurs in in the result document set.

Again, I think a QueryFilter can work well. There are surely several ways to go about getting the number of documents in each bucket - perhaps additional queries should be made to give you those numbers, or perhaps walking the returned documents to get the unique values. Walking the documents could be expensive performance-wise though. Doing some sub-queries would be quite fast though.


Efficient range queries.

application) it's important to have some support for
this.    The trick here is that the criteria may be
very open ended.   For example all items with price
greater than $10 might involve tens of thousands of
prices.

One suggestion I've seen posted is during indexing to use an additional field as a "group". In this case, it would be a price range group. Say "A" means $0 - $10, "B" for $10 - $100, "C" for $100+, for example. Then you would only have a few terms in that field and a query would be quite fast. The drawback is that you need to know at index-time what the groups are.


A custom range Filter is another option - and could be created at runtime and kept around and only recreated when the index is modified. Look at the built-in DateFilter for an example to work with. This is a more pleasant option than doing a RangeQuery when the number of terms in the range is large.


Order by attributes.

We need the ability to order the document results set
by a pre-defined set of numeric attributes and would
like the ability to order on alphabetic attributes as
well.

This is an area where Lucene falls short. My best suggestion is to do the sorting yourself, which would require getting at all the documents in Hits, which for a large collection would be unreasonable. There are tricks that can be played with boosting during indexing where you can tier the boosts of a field in order - but this is really only a hint to the scorer to factor the order into the equation but there are many other factors.


I'm afraid there is no easy solution here, that I'm aware of.

I have resources for code development and consider it
to be in Ariba's best interest to contribute any code
that we write in this area with the entire community.
Our time frame is to develop a proto-type in the next
couple of months for proof of concept and
benchmarking.

Excellent! We hope that we can get Lucene under the covers of your products - please continue to post to us with more questions and hopefully eventually code improvements!


Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to