On May 18, 2006, at 10:43 AM, Robert Engels wrote:

Has anyone thought of (or implemented) caching of term information?

Currently, Lucene stores an index of every nTH term. Then uses this
information to position the TermEnum, and then scans the terms.

Might it be better to read a "page" of term infos (based on the index), and
then keep these pages in a SoftCache in the SegmentTermEnum ?

I'd thought about just making it possible to load up the whole Term Dictionary. Dangerous for large indexes, but interesting. The Google 98 paper indicates that they got their whole dictionary into RAM.

The thing about caching pages of the dictionary is that I don't think that heavily searched terms will be concentrated in one page, so it would probably get swapped a lot. I'm not familiar with SoftCache, though.

KinoSearch currently caches SegmentTermEnum entries as bytestrings, or more accurately "ByteBuf" C structs modeled on Java's ByteBuffer which are basically an array of char, a length, and a capacity. Each bytestring consists of the field number as a big endian 16-bit int, followed by the term text. Since field numbers in KinoSearch are forced to correspond to lexically sorted field name, those sort correctly.

The ByteBufs don't take up a lot of space, and they could be even smaller if they used VInts for field number. If we load everything up, then locating a term in the .tis file can be achieved with a binary search. Pay RAM to buy speed.

It might also make sense to just load up the raw .tis file into RAM. That would require even less memory, and would eliminate the disk seeks, but would still have to be traversed linearly and decompressed.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to