RE: caching term information?

Robert Engels Sat, 20 May 2006 00:01:28 -0700

Maybe don't cache the term pages, then, just cache the frequently requested
terms themselves.

The current scheme is binary search the index for the term.
    - if you get a direct match, you are done.
    - if not you have the page offset, check the next page start term, if
greater than the term does not exist
    - if not scan the page for the term

The problem with the above if that frequently searched terms do not get a
benefit.

Seem that if before you did the above, you checked a LRU/Soft cache for the
term, you would could improve the performance (no page scanning which
entails byte/char reading & conversions).

It would not help the performance for term enumeration, but turning common
prefix queries into a constant scoring filter seems better anyway.

-----Original Message-----
From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 18, 2006 2:56 PM
To: java-dev@lucene.apache.org
Subject: Re: caching term information?

On May 18, 2006, at 10:43 AM, Robert Engels wrote:

> Has anyone thought of (or implemented) caching of term information?
>
> Currently, Lucene stores an index of every nTH term. Then uses this
> information to position the TermEnum, and then scans the terms.
>
> Might it be better to read a "page" of term infos (based on the
> index), and
> then keep these pages in a SoftCache in the SegmentTermEnum ?

I'd thought about just making it possible to load up the whole Term
Dictionary.  Dangerous for large indexes, but interesting.  The
Google 98 paper indicates that they got their whole dictionary into RAM.

The thing about caching pages of the dictionary is that I don't think
that heavily searched terms will be concentrated in one page, so it
would probably get swapped a lot.  I'm not familiar with SoftCache,
though.

KinoSearch currently caches SegmentTermEnum entries as bytestrings,
or more accurately "ByteBuf" C structs modeled on Java's ByteBuffer
which are basically an array of char, a length, and a capacity.  Each
bytestring consists of the field number as a big endian 16-bit int,
followed by the term text.  Since field numbers in KinoSearch are
forced to correspond to lexically sorted field name, those sort
correctly.

The ByteBufs don't take up a lot of space, and they could be even
smaller if they used VInts for field number.  If we load everything
up, then locating a term in the .tis file can be achieved with a
binary search.  Pay RAM to buy speed.

It might also make sense to just load up the raw .tis file into RAM.
That would require even less memory, and would eliminate the disk
seeks, but would still have to be traversed linearly and decompressed.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: caching term information?

Reply via email to