Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)

Marvin Humphrey Thu, 05 Apr 2007 10:05:22 -0700


On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:

So you basically do not "de-dup" by field+term on your first pass
through the tokens in the doc (which is "roughly" what that hash
does).  Instead, append all tokens in an array, then sort first by
field+text and second by position?  This is done for each document
right?


Almost.  The sorting is done per-field.

Token doesn't have a "field", so comparison is cheaper than you'rethinking.


int
Token_compare(const void *va, const void *vb)
{
    Token *const a = *((Token**)va);
    Token *const b = *((Token**)vb);

    size_t min_len = a->len < b->len ? a->len : b->len;

    int comparison = memcmp(a->text, b->text, min_len);

    if (comparison == 0) {
        if (a->len != b->len) {
            comparison = a->len < b->len ? -1 : 1;
        }
        else {
            comparison = a->pos < b->pos ? -1 : 1;
        }
    }

    return comparison;
}

Did you every compare this approach against hash (or other de-dup data
structure, letter trie or something) approach?

KS used to use hashing, though it wasn't directly analogous to howLucene does things. I've only tried these two techniques. This wasfaster by about 30%, but the difference is not all in the de-duping.

KS is concerned with preparing serialized postings to feed into anexternal sorter. In the hashing stratagem, every position added to aterm_text => serialized_posting pair in the hash requires a stringconcatenation onto the end of serialized_posting, and thus a call torealloc().

Besides switching out hashing overhead for qsort overhead, thesorting technique also allows KS to know up front how many positionsare associated with each posting, so the memory for the serializedstring only has to be allocated once.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)

Reply via email to