"Marvin Humphrey" <[EMAIL PROTECTED]> wrote: > > On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote: > > > So you basically do not "de-dup" by field+term on your first pass > > through the tokens in the doc (which is "roughly" what that hash > > does). Instead, append all tokens in an array, then sort first by > > field+text and second by position? This is done for each document > > right? > > Almost. The sorting is done per-field. > > Token doesn't have a "field", so comparison is cheaper than you're > thinking.
Got it. I've done exactly that (process one field's tokens at a time) with LUCENE-843 as well. > > Did you every compare this approach against hash (or other de-dup data > > structure, letter trie or something) approach? > > KS used to use hashing, though it wasn't directly analogous to how > Lucene does things. I've only tried these two techniques. This was > faster by about 30%, but the difference is not all in the de-duping. OK. 30% is very nice :) > KS is concerned with preparing serialized postings to feed into an > external sorter. In the hashing stratagem, every position added to a > term_text => serialized_posting pair in the hash requires a string > concatenation onto the end of serialized_posting, and thus a call to > realloc(). > > Besides switching out hashing overhead for qsort overhead, the > sorting technique also allows KS to know up front how many positions > are associated with each posting, so the memory for the serialized > string only has to be allocated once. Yeah those realloc()'s are costly (Lucene trunk has them too). In LUCENE-843 I found a way to share large int[] arrays so that a given posting uses slices into the shared arrays instead of doing reallocs. I think I'm doing something similar to feeding an external sorter: I'm just using the same approach as Lucene's segment merging of the postings, optimized somewhat to handle a very large number of segments at once (for the merging of the "level 0" single document segments). I use this same merger to merge level N RAM segments to level N+1 ram segments, to merge RAM segments into a single flushed segment, to merge flushed segments into a single flushed segment and then finally to merge flushed and RAM segments into the "real" Lucene segment at the end. I think it differs from an external sorter in that I manage explicitly when to flush a "run" to disk (autoCommit=false case) or to flush it to a Lucene segment (autoCommit=true case) rather than letting the sorter API decide. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]