Re: caching term information?

Marvin Humphrey Thu, 18 May 2006 12:56:31 -0700


On May 18, 2006, at 10:43 AM, Robert Engels wrote:

Has anyone thought of (or implemented) caching of term information?

Currently, Lucene stores an index of every nTH term. Then uses this
information to position the TermEnum, and then scans the terms.

Might it be better to read a "page" of term infos (based on theindex), and

then keep these pages in a SoftCache in the SegmentTermEnum ?

I'd thought about just making it possible to load up the whole TermDictionary. Dangerous for large indexes, but interesting. TheGoogle 98 paper indicates that they got their whole dictionary into RAM.

The thing about caching pages of the dictionary is that I don't thinkthat heavily searched terms will be concentrated in one page, so itwould probably get swapped a lot. I'm not familiar with SoftCache,though.

KinoSearch currently caches SegmentTermEnum entries as bytestrings,or more accurately "ByteBuf" C structs modeled on Java's ByteBufferwhich are basically an array of char, a length, and a capacity. Eachbytestring consists of the field number as a big endian 16-bit int,followed by the term text. Since field numbers in KinoSearch areforced to correspond to lexically sorted field name, those sortcorrectly.

The ByteBufs don't take up a lot of space, and they could be evensmaller if they used VInts for field number. If we load everythingup, then locating a term in the .tis file can be achieved with abinary search. Pay RAM to buy speed.

It might also make sense to just load up the raw .tis file into RAM.That would require even less memory, and would eliminate the diskseeks, but would still have to be traversed linearly and decompressed.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: caching term information?

Reply via email to