"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> 
> On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:
> 
> > So you basically do not "de-dup" by field+term on your first pass
> > through the tokens in the doc (which is "roughly" what that hash
> > does).  Instead, append all tokens in an array, then sort first by
> > field+text and second by position?  This is done for each document
> > right?
> 
> Almost.  The sorting is done per-field.
>
> Token doesn't have a "field", so comparison is cheaper than you're  
> thinking.

Got it.  I've done exactly that (process one field's tokens at a time)
with LUCENE-843 as well.

> > Did you every compare this approach against hash (or other de-dup data
> > structure, letter trie or something) approach?
> 
> KS used to use hashing, though it wasn't directly analogous to how  
> Lucene does things.  I've only tried these two techniques.  This was  
> faster by about 30%, but the difference is not all in the de-duping.

OK.  30% is very nice :)

> KS is concerned with preparing serialized postings to feed into an  
> external sorter.  In the hashing stratagem, every position added to a  
> term_text => serialized_posting pair in the hash requires a string  
> concatenation onto the end of serialized_posting, and thus a call to  
> realloc().
>
> Besides switching out hashing overhead for qsort overhead, the  
> sorting technique also allows KS to know up front how many positions  
> are associated with each posting, so the memory for the serialized  
> string only has to be allocated once.

Yeah those realloc()'s are costly (Lucene trunk has them too).  In
LUCENE-843 I found a way to share large int[] arrays so that a given
posting uses slices into the shared arrays instead of doing reallocs.

I think I'm doing something similar to feeding an external sorter: I'm
just using the same approach as Lucene's segment merging of the
postings, optimized somewhat to handle a very large number of segments
at once (for the merging of the "level 0" single document segments).
I use this same merger to merge level N RAM segments to level N+1 ram
segments, to merge RAM segments into a single flushed segment, to
merge flushed segments into a single flushed segment and then finally
to merge flushed and RAM segments into the "real" Lucene segment at
the end.

I think it differs from an external sorter in that I manage explicitly
when to flush a "run" to disk (autoCommit=false case) or to flush it
to a Lucene segment (autoCommit=true case) rather than letting the
sorter API decide.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to