Maybe don't cache the term pages, then, just cache the frequently requested terms themselves.
The current scheme is binary search the index for the term. - if you get a direct match, you are done. - if not you have the page offset, check the next page start term, if greater than the term does not exist - if not scan the page for the term The problem with the above if that frequently searched terms do not get a benefit. Seem that if before you did the above, you checked a LRU/Soft cache for the term, you would could improve the performance (no page scanning which entails byte/char reading & conversions). It would not help the performance for term enumeration, but turning common prefix queries into a constant scoring filter seems better anyway. -----Original Message----- From: Marvin Humphrey [mailto:[EMAIL PROTECTED] Sent: Thursday, May 18, 2006 2:56 PM To: java-dev@lucene.apache.org Subject: Re: caching term information? On May 18, 2006, at 10:43 AM, Robert Engels wrote: > Has anyone thought of (or implemented) caching of term information? > > Currently, Lucene stores an index of every nTH term. Then uses this > information to position the TermEnum, and then scans the terms. > > Might it be better to read a "page" of term infos (based on the > index), and > then keep these pages in a SoftCache in the SegmentTermEnum ? I'd thought about just making it possible to load up the whole Term Dictionary. Dangerous for large indexes, but interesting. The Google 98 paper indicates that they got their whole dictionary into RAM. The thing about caching pages of the dictionary is that I don't think that heavily searched terms will be concentrated in one page, so it would probably get swapped a lot. I'm not familiar with SoftCache, though. KinoSearch currently caches SegmentTermEnum entries as bytestrings, or more accurately "ByteBuf" C structs modeled on Java's ByteBuffer which are basically an array of char, a length, and a capacity. Each bytestring consists of the field number as a big endian 16-bit int, followed by the term text. Since field numbers in KinoSearch are forced to correspond to lexically sorted field name, those sort correctly. The ByteBufs don't take up a lot of space, and they could be even smaller if they used VInts for field number. If we load everything up, then locating a term in the .tis file can be achieved with a binary search. Pay RAM to buy speed. It might also make sense to just load up the raw .tis file into RAM. That would require even less memory, and would eliminate the disk seeks, but would still have to be traversed linearly and decompressed. Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]