"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> 
> On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:
> 
> > The one thing that still baffles me is: I can't get a persistent
> > Posting hash to be any faster.
> 
> Don't use a hash, then.  :)
> 
> KS doesn't.
> 
>    * Give Token a "position" member.
>    * After you've got accumulated all the Tokens, calculate
>      position for each token from the position increment.
>    * Arrange the postings in an array sorted by position.
>    * Count the number of postings in a row with identical
>      text to get freq.
> 
> Relevant code from KinoSearch::Analysis::TokenBatch below.

OOOOOH!  I like that approach!

So you basically do not "de-dup" by field+term on your first pass
through the tokens in the doc (which is "roughly" what that hash
does).  Instead, append all tokens in an array, then sort first by
field+text and second by position?  This is done for each document
right?

This seems like it could be a major win!

Did you every compare this approach against hash (or other de-dup data
structure, letter trie or something) approach?

I guess it depends on how many total terms you have in the doc vs how
many unique terms you have in the doc.  Qsort is NlogN, and, the
comparison of 2 terms is somewhat costly.  With de-duping, you pay a
hash lookup/insert cost (plus cost of managing little int buffers to
hold positions/offsets/etc) per term, but then only qsort on the
number of unique terms.

Whereas with your approach you don't pay any deduping cost but your
qsort is on total # terms, not total # unique terms.  I bet your
approach is quite a bit faster for small to medium sized docs (by
far the "norm") but not faster for very large docs?  Or maybe it's
even faster for very large docs because qsort is so darned fast :)

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to