efficient refinement, order by and range queries

Geoffrey Peddle Mon, 22 Dec 2003 13:17:40 -0800

I'm attempting to make Lucene the document search
solution for the Ariba application suite.  I have
identified functionality gaps in the areas of
refinement, order by and range queries.  Below I
attempt to describe our requirements and begin a
discussion on how to solve them.



Refinement/attribute value collection.
 
One of our applications is a catalog search
application.    In this application our documents are
catalog items.   Each item has a number of
fields/attributes associated with it.   For example
Supplier, Part number, Price, Description.   We use a
search metaphor where end-users iterate issuing
queries and getting feedback about what's available.  
So initially we may tell them that 600,000 items are
available from  95 suppliers, and who those suppliers
are.   They may choose to do a free text search for
the phrase "blue pen".  The result of that query may
be to tell them that there's 240 items available from
2 suppliers which match that phrase, and who those
suppliers are.  They may pick one of the suppliers to
see the list of "blue pens" available from that
supplier.
 
In addition to wanting the set of attribute values
found in the result documents we would also want to
return counts of the number of documents each
attribute value occurs in in the result document set. 
 To limit memory consumption for attributes with many
values in the result document set we could specify a
limit beyond which additional new values would not be
collected, say return at most 200 suppliers which
match the query.
 
Efficient range queries.
 
There are certain attributes in our catalog item which
we want to support range queries on.  For example
price.    While it would be unusual for this to be the
primary search criteria (people rarely say show me
anything that costs greater than $10 in our
application) it's important to have some support for
this.    The trick here is that the criteria may be
very open ended.   For example all items with price
greater than $10 might involve tens of thousands of
prices.
 
 
Order by attributes.
 
We need the ability to order the document results set
by a pre-defined set of numeric attributes and would
like the ability to order on alphabetic attributes as
well.
 
 
What we are currently thinking of as the solution to
these first 3 "gaps" is an optional document based
payload or term vector.  We think we could use this in
a hit collector to post-filter any results outside our
range queries, collect refinement values and order the
results.   It seems we could build an external
mechanism similar to what's used in the search beans
for this or extend the core product.    Does this seem
like a reasonable approach?   Any educated guesses on
what the percent increase in search response time will
be if we do a good job?
 
The trickiest part seems to be efficiently collecting
the refinement data for a set of 5-15 attributes.  Any
suggestions on data structures to use to
encode/collect this?   As we develop a design for this
I'll post it for additional feedback.

I have resources for code development and consider it
to be in Ariba's best interest to contribute any code
that we write in this area with the entire community. 
Our time frame is to develop a proto-type in the next
couple of months for proof of concept and
benchmarking.

-Geoffrey






______________________________________________________________________ 
Post your free ad now! http://personals.yahoo.ca

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

efficient refinement, order by and range queries

Reply via email to