Roughly, the current approach for the default terms dict codec in
LUCENE-1458 is:
* Create a separate class per-field (the String field in each Term
is redundant). This is a big change over Lucene today....
* That class has String[] indexText and long[] indexPointer, each
length = the number of index terms. No TermInfo instance nor Term
instance are used.
* Modify the tis format to also store its data by field
* Modify the tis format so that at a seek point (ie an indexed
term), absolute values are written for freq/prox pointer, but
continue to delta-code in between indexed terms. EG this is how
video codecs work (every so often they write a "key frame" which
you can seek to & immediately decode w/ no prior context).
* tii then just stores text/long (delta coded) for all indexed
terms, and is slurped into the arrays on init.
This is a sizable RAM savings over what's done now because you save 2
objects, 3 pointers, 2 longs, 2 ints (I think), per indexed term.
Mike
On Wed, Jun 10, 2009 at 2:02 PM, Jason
Rutherglen<[email protected]> wrote:
>> LUCENE-1458 (flexible indexing) has these improvements,
>
> Mike, can you explain how it's different? I looked through the code once
> but yeah, it's in with a lot of other changes.
>
> On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless <
> [email protected]> wrote:
>
>> This (very large number of unique terms) is a problem for Lucene currently.
>>
>> There are some simple improvements we could make to the terms dict
>> format to not require so much RAM per term in the terms index...
>> LUCENE-1458 (flexible indexing) has these improvements, but
>> unfortunately tied in w/ lots of other changes. Maybe we should break
>> out a separate issue for this... this'd be a great contained
>> improvement, if anyone out there has "the itch" :)
>>
>> One simple workaround is to call IndexReader.setTermIndexInterval
>> immediately after opening the reader; this simply loads fewer terms in
>> the index, using far less RAM, but at the expense of somewhat slower
>> searching.
>>
>> Also: you should peek at your index, eg using Luke, to understand why
>> you have so many terms. It could be legitimate (indexing a massive
>> catalog with eg part numbers), or, it could be your document filtering
>> / analyzer are accidentally producing garbage terms.
>>
>> Mike
>>
>> On Wed, Jun 10, 2009 at 8:23 AM, Benedikt Boss<[email protected]> wrote:
>> > Hej hej,
>> >
>> > i have a question regarding lucenes memory usage
>> > when launching a query. When i execute my query
>> > lucene eats up over 1gig of heap-memory even
>> > when my result-set is only a single hit. I
>> > found out that this is due to the "ensureIndexIsRead()"
>> > method-call in the "TermInfosReader" class, which
>> > iterates over all Terms found in the index and saves
>> > them (including all value-strings) in a Term-Array.
>> > Is it possible to not read all that stuff
>> > into memory at all?
>> >
>> > Im doing the query like in the following pseudo-code:
>> > ------------------------------------------------------------------------
>> >
>> > TopScoreDocCollector collector = new TopScoreDocCollector(100000);
>> >
>> > QueryParser parser= new QueryParser(field, new WhitespaceAnalyzer() );
>> > Directory fsDir = new FSDirectory(indexDir, null);
>> > IndexSearcher is = new IndexSearcher(fsdir);
>> >
>> > Query query = parser.parse(q);
>> >
>> > is.search(query, collector);
>> > ScoreDoc[] hits = collector.topDocs();
>> >
>> > ....... < iterate over hits and print results >
>> >
>> >
>> > Thanks in advance
>> > Benedikt
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]