Yonik Seeley wrote:
Totally untested, but here is a hack at what the scorer might look
like when the number of terms is large.

Looks plausible to me.

You could instead use a byte[maxDoc] and encode/decode floats as you store and read them, to use a lot less RAM.

  // could also use a bitset to keep track of docs in the set...

I think that is probably a very important optimization.

If you implemented both of these suggestions, this would use 5 bits/doc, instead of 33 bits/doc. With a 100M doc index, that would be the difference between 62MB/query and 412MB/query. The classic term expanding approach uses perhaps 2k/term. So, with a 100M document index, the byte array approach uses less memory for queries which expand to more than 3,100 terms. The float-array method uses less memory for queries with more than 206k terms.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to