On Dec 5, 2009, at 5:22 PM, Grant Ingersoll wrote:

> At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and the 
> speaker said one of the optimizations they do when retrieving a large result 
> set is that instead of managing a Priority Queue, they just allocate a large 
> array to hold all of the results and then sort afterward.   Seemed like a 
> good idea since you could avoid a whole slew of PQ operations at the cost of 
> (possibly) some extra memory, so I thought I would see if anyone has looked 
> at doing something similar here.  (Xapian is C++, I believe, so it likely has 
> different perf. characteristics which may or may not matter).  
> 
> A few things come to mind:
> 1. In many cases, when someone is asking for a lot of results, my guess is 
> they want all the results.  You often see this in people who are doing 
> significant post processing on large result sets (often for machine learning)
> 2. My gut says there is actually some heuristic to be applied here, whereby 
> if the requested number of results is some percentage of the total num. of 
> hits, then do the whole results + sort, otherwise, use PQ.  Thus, this maybe 
> would be useful even when doing something like, say, 500 or a 1000 docs, as 
> opposed to the bigger cases I'm thinking about.  Still, don't need to 
> prematurely optimize.
> 3.  I think we could almost implement this entirely w/ a Collector, except 
> for the sort step, which presumably could be a callback (empty implementation 
> in the base class).
> 4. When doing this, you don't bother to truncate the list when n < totalHits 
> (although see below).
> 
> Perhaps, we could also even save having a big array for doc ids and instead 
> just flip bits in a bit set (would still need the array for scores) and then 
> materialize the bit set to an array at the end when the num requested is less 
> than the total number of hits.
> 
> Anyone have thoughts on this?  Seems fairly trivial to crank out a Collector 
> to do it, minus the post processing step, which would be relatively trivial 
> to add.

I'm not sure what constitutes "lots of results".

For my application, most searches are for all matching documents. We typically 
don't exclude stop words in building the index and "OR" is the default 
connector for searches. Until users figure the impact of this out, search 
results are typically about 64K hits.

To clarify our use case, we index Bibles with each verse as a document. A bit 
of a simplification regarding the structure, for those that are unfamiliar, a 
Bible is an ordered collection of books, broken up into chapters and then the 
sentences (aka verses) are numbered. Stored in each document is a value that 
represents its position in the original whole document. Depending on the number 
of hits and the adjacency of hits, we'll choose a storage method. Sometimes it 
is a bitmap, or it might be an array with run length encoding.

When all matches are requested, the results are sorted based on sentence 
ordering. This is relatively trivial. The container is responsible for doing 
the ordering. So the collector does not require documents to be in order.

In this case, scoring is entirely unimportant.

I worked for a legal search company and it did something similar for legal 
documents, numbering sections, paragraphs and sentences along with pages 
over-laid. (Interestingly and unrelated to this question, each type of legal 
document had a different set of stop words. For example, in divorce law, 
"divorce" is a noise word.) In this world, scoring was important since the best 
results were being sought.

It is pretty trivial, as you say, to crank out a Collector to do it. For the 
beginner or casual user, it is good to have at least a simple example/demo. So 
yes, it would be helpful.

I just went through this when migrating to 2.9.

--DM




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to