Re: Lucene memory usage

Michael McCandless Wed, 10 Jun 2009 11:27:17 -0700

Roughly, the current approach for the default terms dict codec in
LUCENE-1458 is:


  * Create a separate class per-field (the String field in each Term
    is redundant).  This is a big change over Lucene today....

  * That class has String[] indexText and long[] indexPointer, each
    length = the number of index terms.  No TermInfo instance nor Term
    instance are used.

  * Modify the tis format to also store its data by field

  * Modify the tis format so that at a seek point (ie an indexed
    term), absolute values are written for freq/prox pointer, but
    continue to delta-code in between indexed terms.  EG this is how
    video codecs work (every so often they write a "key frame" which
    you can seek to & immediately decode w/ no prior context).

  * tii then just stores text/long (delta coded) for all indexed
    terms, and is slurped into the arrays on init.

This is a sizable RAM savings over what's done now because you save 2
objects, 3 pointers, 2 longs, 2 ints (I think), per indexed term.

Mike

On Wed, Jun 10, 2009 at 2:02 PM, Jason
Rutherglen<[email protected]> wrote:
>> LUCENE-1458 (flexible indexing) has these improvements,
>
> Mike, can you explain how it's different?  I looked through the code once
> but yeah, it's in with a lot of other changes.
>
> On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless <
> [email protected]> wrote:
>
>> This (very large number of unique terms) is a problem for Lucene currently.
>>
>> There are some simple improvements we could make to the terms dict
>> format to not require so much RAM per term in the terms index...
>> LUCENE-1458 (flexible indexing) has these improvements, but
>> unfortunately tied in w/ lots of other changes.  Maybe we should break
>> out a separate issue for this... this'd be a great contained
>> improvement, if anyone out there has "the itch" :)
>>
>> One simple workaround is to call IndexReader.setTermIndexInterval
>> immediately after opening the reader; this simply loads fewer terms in
>> the index, using far less RAM, but at the expense of somewhat slower
>> searching.
>>
>> Also: you should peek at your index, eg using Luke, to understand why
>> you have so many terms.  It could be legitimate (indexing a massive
>> catalog with eg part numbers), or, it could be your document filtering
>> / analyzer are accidentally producing garbage terms.
>>
>> Mike
>>
>> On Wed, Jun 10, 2009 at 8:23 AM, Benedikt Boss<[email protected]> wrote:
>> > Hej hej,
>> >
>> > i have a question regarding lucenes memory usage
>> > when launching a query. When i execute my query
>> > lucene eats up over 1gig of heap-memory even
>> > when my result-set is only a single hit. I
>> > found out that this is due to the "ensureIndexIsRead()"
>> > method-call in the "TermInfosReader" class, which
>> > iterates over all Terms found in the index and saves
>> > them (including all value-strings) in a Term-Array.
>> > Is it possible to not read all that stuff
>> > into memory at all?
>> >
>> > Im doing the query like in the following pseudo-code:
>> > ------------------------------------------------------------------------
>> >
>> > TopScoreDocCollector collector = new TopScoreDocCollector(100000);
>> >
>> > QueryParser   parser= new QueryParser(field, new WhitespaceAnalyzer() );
>> > Directory     fsDir = new FSDirectory(indexDir, null);
>> > IndexSearcher is    = new IndexSearcher(fsdir);
>> >
>> > Query         query = parser.parse(q);
>> >
>> > is.search(query, collector);
>> > ScoreDoc[] hits = collector.topDocs();
>> >
>> > ....... < iterate over hits and print results >
>> >
>> >
>> > Thanks in advance
>> > Benedikt
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Lucene memory usage

Reply via email to