Re: SortCollector

Nathan Kurz Sat, 02 May 2009 11:51:53 -0700

On Fri, May 1, 2009 at 5:02 PM, Marvin Humphrey <[email protected]> wrote:
> On Fri, May 01, 2009 at 01:18:22PM -0600, Nathan Kurz wrote:
>> but I'm not quite getting the bigger picture.  Could you contextualize a
>> little?
>
> When iterating over hits, SortCollector maintains the priority queue which
> keeps track of the highest sorting/scoring documents.


I did get that, but I'm still confused about who owns the SortCache.
Is it considered to be just another index format?  Is the
SortCollector in addition to the HitCollector or a substitute   And
does the integration of results from multiple servers (or segments)
happen above or below this?

For example, assume I've got a corpus of movies reviews and I want to
search for "fulltext:insightful rating>3 date>01/01/2008" and I to
return results ordered by "date".   I have real-time updates to the
index and thus have multiple segments.   What happens?

> For floats, we should compare values directly, as Mike pointed out in his
> reply.  I asserted that we should use ords for integer types in my initial
> post, but that's not the right way to do things.

This seems right, and I wonder further if you could simplify the logic
in your SortCollector by casting the all the Ords to Ints and then
using a single comparison function.  Store them as packed as you wish,
but upgrade them to full width before doing the comparison.

Ideally, this would be transparent to the SortCollector and completely
internal to the SortCache.   It might even be efficient to cast
everything to a Float or Double and get rid of the entire switch()
statement.

I'm not sure of the exact way to implement this, but I think this idea
has potential.  Store the ordinals as compactly as you can, but then
cast them in any order preserving manner so that sorting is equivalent
to scoring.

> In theory, that problem is solved.  The ords arrays are only used within
> segments.  When reconciling hits from different segments -- or different
> machines (!) -- we use real values.

Ouch, that would explain much of my confusion.  So the SortCache is an
optimization internal to a Scorer used to winnow the results down to a
manageable subset, and we still might need to use more expensive means
to reconcile multiple result sets?

I'm still worried about how to go about combining these result sets,
though, particularly if when doing things like asking for results deep
within the ordered list (ie, results 1000-1100).  But maybe this can
just be solved with additional constraints in the query?

Nathan Kurz
[email protected]

Re: SortCollector

Reply via email to