I'm attempting to make Lucene the document search solution for the Ariba application suite. I have identified functionality gaps in the areas of refinement, order by and range queries. Below I attempt to describe our requirements and begin a discussion on how to solve them.
Refinement/attribute value collection. One of our applications is a catalog search application. In this application our documents are catalog items. Each item has a number of fields/attributes associated with it. For example Supplier, Part number, Price, Description. We use a search metaphor where end-users iterate issuing queries and getting feedback about what's available. So initially we may tell them that 600,000 items are available from 95 suppliers, and who those suppliers are. They may choose to do a free text search for the phrase "blue pen". The result of that query may be to tell them that there's 240 items available from 2 suppliers which match that phrase, and who those suppliers are. They may pick one of the suppliers to see the list of "blue pens" available from that supplier. In addition to wanting the set of attribute values found in the result documents we would also want to return counts of the number of documents each attribute value occurs in in the result document set. To limit memory consumption for attributes with many values in the result document set we could specify a limit beyond which additional new values would not be collected, say return at most 200 suppliers which match the query. Efficient range queries. There are certain attributes in our catalog item which we want to support range queries on. For example price. While it would be unusual for this to be the primary search criteria (people rarely say show me anything that costs greater than $10 in our application) it's important to have some support for this. The trick here is that the criteria may be very open ended. For example all items with price greater than $10 might involve tens of thousands of prices. Order by attributes. We need the ability to order the document results set by a pre-defined set of numeric attributes and would like the ability to order on alphabetic attributes as well. What we are currently thinking of as the solution to these first 3 "gaps" is an optional document based payload or term vector. We think we could use this in a hit collector to post-filter any results outside our range queries, collect refinement values and order the results. It seems we could build an external mechanism similar to what's used in the search beans for this or extend the core product. Does this seem like a reasonable approach? Any educated guesses on what the percent increase in search response time will be if we do a good job? The trickiest part seems to be efficiently collecting the refinement data for a set of 5-15 attributes. Any suggestions on data structures to use to encode/collect this? As we develop a design for this I'll post it for additional feedback. I have resources for code development and consider it to be in Ariba's best interest to contribute any code that we write in this area with the entire community. Our time frame is to develop a proto-type in the next couple of months for proof of concept and benchmarking. -Geoffrey ______________________________________________________________________ Post your free ad now! http://personals.yahoo.ca --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
