On Apr 5, 2007, at 8:54 AM, Michael McCandless wrote:

So you basically do not "de-dup" by field+term on your first pass
through the tokens in the doc (which is "roughly" what that hash
does).  Instead, append all tokens in an array, then sort first by
field+text and second by position?  This is done for each document
right?

Almost.  The sorting is done per-field.

Token doesn't have a "field", so comparison is cheaper than you're thinking.

int
Token_compare(const void *va, const void *vb)
{
    Token *const a = *((Token**)va);
    Token *const b = *((Token**)vb);

    size_t min_len = a->len < b->len ? a->len : b->len;

    int comparison = memcmp(a->text, b->text, min_len);

    if (comparison == 0) {
        if (a->len != b->len) {
            comparison = a->len < b->len ? -1 : 1;
        }
        else {
            comparison = a->pos < b->pos ? -1 : 1;
        }
    }

    return comparison;
}

Did you every compare this approach against hash (or other de-dup data
structure, letter trie or something) approach?

KS used to use hashing, though it wasn't directly analogous to how Lucene does things. I've only tried these two techniques. This was faster by about 30%, but the difference is not all in the de-duping.

KS is concerned with preparing serialized postings to feed into an external sorter. In the hashing stratagem, every position added to a term_text => serialized_posting pair in the hash requires a string concatenation onto the end of serialized_posting, and thus a call to realloc().

Besides switching out hashing overhead for qsort overhead, the sorting technique also allows KS to know up front how many positions are associated with each posting, so the memory for the serialized string only has to be allocated once.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to