On Fri, May 1, 2009 at 5:02 PM, Marvin Humphrey <[email protected]> wrote: > On Fri, May 01, 2009 at 01:18:22PM -0600, Nathan Kurz wrote: >> but I'm not quite getting the bigger picture. Could you contextualize a >> little? > > When iterating over hits, SortCollector maintains the priority queue which > keeps track of the highest sorting/scoring documents.
I did get that, but I'm still confused about who owns the SortCache. Is it considered to be just another index format? Is the SortCollector in addition to the HitCollector or a substitute And does the integration of results from multiple servers (or segments) happen above or below this? For example, assume I've got a corpus of movies reviews and I want to search for "fulltext:insightful rating>3 date>01/01/2008" and I to return results ordered by "date". I have real-time updates to the index and thus have multiple segments. What happens? > For floats, we should compare values directly, as Mike pointed out in his > reply. I asserted that we should use ords for integer types in my initial > post, but that's not the right way to do things. This seems right, and I wonder further if you could simplify the logic in your SortCollector by casting the all the Ords to Ints and then using a single comparison function. Store them as packed as you wish, but upgrade them to full width before doing the comparison. Ideally, this would be transparent to the SortCollector and completely internal to the SortCache. It might even be efficient to cast everything to a Float or Double and get rid of the entire switch() statement. I'm not sure of the exact way to implement this, but I think this idea has potential. Store the ordinals as compactly as you can, but then cast them in any order preserving manner so that sorting is equivalent to scoring. > In theory, that problem is solved. The ords arrays are only used within > segments. When reconciling hits from different segments -- or different > machines (!) -- we use real values. Ouch, that would explain much of my confusion. So the SortCache is an optimization internal to a Scorer used to winnow the results down to a manageable subset, and we still might need to use more expensive means to reconcile multiple result sets? I'm still worried about how to go about combining these result sets, though, particularly if when doing things like asking for results deep within the ordered list (ie, results 1000-1100). But maybe this can just be solved with additional constraints in the query? Nathan Kurz [email protected]
