Very nice implementation and a great write up. How large is the index? And when you keep posting new content to the index, will you optimize the index?
-- Chris Lu ------------ Lucene Search RAD on Any Database http://www.dbsight.net On 8/30/05, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > I'm pleased to announce that for about a month now, CNET's "Product > Listing" pages are powered by Lucene 1.4.3. These pages not only allow > users to browse CNET's catalog of tech products by category, but also to > "Filter" the lists according to category specific Attribute Filters which > are displayed along with counts of how many products they will get if they > apply that Filter. Multiple Filters can be applied (in any order) to > rapidly narrow down the list of products. > > Examples of these pages can be seen here... > > Digital Cameras > http://reviews.cnet.com/4566-6501_7-0.html > > Inkjet Printers > http://reviews.cnet.com/4566-3156_7-0.html > > Epson Inkjet Printers > http://reviews.cnet.com/4566-3156_7-0.html?filter=1000036_5251152_ > > Epson Inkjet Printers that can print on Transparencies > http://reviews.cnet.com/4566-3156_7-0.html?filter=1000036_5251152_500193_5314692_ > > These pages work much the same way as I've described in past threads > regarding "category counts", except that the logic determining which > filter links to display is not as simple as just pulling out the most > frequent terms per field, or based on a fixed list. As you can see from > the example links, each category has it's own unique list of attributes > (ie: Price, Manufacturer, etc...) and for each of those attributes, there > is a list of Queries which map one to one with a possible "Find by" link. > Even if an Attribute is common between two categories, the list of Queries > to filter by may be different -- note the differences in the "Find by > price" lists between the various links above. We have several thousand > unique categories, and some of these have as many as a thousand unique > Filter Queries which are needed to determine the counts to display on any > given page for that category, but using some very aggressive Filter > caching the time for a single request is kept very manageable. > > For those who are interested, I can elaborate a little more on how these > pages work.... > > > > At a high level there are four major pieces... > > 1) A Servlet which abstracts away most of the Lucene index modification > APIs into an HTTP/XML based "web service" by accepting POSTed XML > documents to add/update in the index. It also replies to GET search > requests using query plugins that have access to an IndexReader. > > 2) A ProductData index updater, which is executed as part of our "product > publishing" process. Anytime a product is added (or modified) in our > database the updater creates an XML document describing the product and > POSTs it to the above mentioned Servlet (which indexes it). > > 3) A Metadata index updater, which is executed as part of our "category > metadata publishing" process. Anytime someone decides to change the > metadata that describes a category, this process creates an XML document > containing that metadata, and POSTs it to the above mentioned Servlet > (I'll elaborate more on these category documents in a moment). > > 4) A Query Plugin used by the Servlet specificly to generate the product > result lists and counts needed for these product listing pages. > > > The Category Metadata documents are what really drive the behavior of the > Plugin. They contain the following information... > > * A Query whose results are all products to display in this category > * An ordered list of Attributes that can be filtered on > - A datatype for each Attribute > - An ordered list of "Filters" for each Attribute > + A label to display for each Filter > + A Query to define what products match that Filter > > When a request comes in for a category, the first thing the Plugin does is > an initial query on the category Id to get the category's Metadata > document. From that document, the field containing the Query that defines > that category is extracted, and a search is issued against it (using > whatever Sort options have been specified). This Category Query is also > used to build a QueryFilter so that a BitSet of every matching product can > be obtained. For each Filter in each Attribute found in the Category's > document, the Query is extracted, and again a QueryFilter is built to > obtain a BitSet of all products which match. The intersection of that > BitSet with the BitSet from the initial Category Query is computed to > determine the "count" to display next to the Filter label. Once all of > this is done the list of products, all of the data from the Category > Metadata document, and the counts for each of the Category Filters are > bundled up into an XML response document. The client which initiated the > search can then apply additional Business logic to decide which > attributes/filters to display counts for -- the simple case is to display > the first N attributes, and for each attribute display the Filter links > with the highest counts, but in some cases the links may be displayed in > different orders based on the datatype of the attribute. > > When a user clicks on a Filter link the process is the same as before, but > the initial Category Query is augmented by the Filter that has been > selected -- so the results to display on the first page (and the BitSet of > all matching products) are correct. The new counts (which take into > account the selected Filter) are computed exactly as before -- using > BitSet intersections. > > > What makes all of this feasible to do during a single user request, and > what keeps the load on our servers manageable, is an aggressive caching > strategy. > > The Servlet maintains a single IndexReader for use by the any requests to > the Query Plugin. The Servlet also maintains a fixed size Cache of [ > Filter => BitSet] (This cache currently uses LRU replacement, but ideally > it would be LFU). The Servlet keeps track of when it makes modifications > to the index, and once it's decided that it is time to make those > modifications visible to the plugin it uses a background thread to open a > new IndexReader, and create a new Cache instance which it warms up by > "pre-computing" the BitSets for the top N Filters in the Cache already in > use. It then swaps out the "old" IndexReader/Cache pair with the new > IndexReader/Cache for all subsequent searches. > > Given a large amount of RAM, and infrequent updates to the index, page > hits to most categories rarely involve anything more then Cache lookups. > But even when we make frequent updates, the Cache warming we do with a > newly constructed IndexReader prior to actually *using* the IndexReader > allows us to remain very responsive on our most popular categories. > Through configuration, we can decide: Is it more important to open new > IndexReaders as fast as possible (and display new results immediately) at > the expense of not being able to pre-warm the cache very much? (resulting > in slower page loads) ... Or: Is it more important to keep our page load > times very low, by pre-warming our cache with everything and the kitchen > sink (which means results take a while to update because we are opening > new IndexReaders less frequently). Our current configuration results in > ~95% cache hit rate for our Filters. > > > > Hopefully I've explained the overall design of our system well enough that > people interested in doing "category counts" and "drilling down" can see > that it is possible, even when you are dealing with a very large number of > Filters. I'm sure my familiarity with the system has caused me to write > something that makes perfect sense to me, but is totally unintelligible to > everyone else -- if so, please feel free to ask any questions you have, > I'll try to answer them as best I can. Some questions regarding the > internals of the Servlet and the Caching it does may be beyond my ability > to answer because they were developed by my coworker -- but he is also an > active participant on this list, and (time permitting) I'm sure he'd > happily answer any questions I can not. > > > > -Hoss > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]