Re: heap memory issues when sorting by a string field

Michael McCandless Fri, 11 Dec 2009 05:54:00 -0800

How long does Lucene take to build the ords for the toplevel reader?

You should be able to just time FieldCache.getStringIndex(topLevelReader).


I think your 8.5 seconds for first Lucene search was with the
StringIndex computed per segment?

Mike

On Fri, Dec 11, 2009 at 8:30 AM, Toke Eskildsen <t...@statsbiblioteket.dk> 
wrote:
> I've spend the last day working on a multipass order builder, where the
> order is defined by a Collator and stored in an int-array. Compromising
> a bit on the "minimal memory at all cost"-approach resulted in a fair
> boost in speed, but it's still very slow for the first sorted search,
> compared to Lucene's build-in Collator-based sorter.
>
> One test-corpus was 2.9M documents (16GB index), sorting on title (2.1M
> unique terms). The machine was a fairly new Xeon-something using SSDs.
> Lucene 2.4 was used.
>
> Memory-usage for the multipass order builder on first sort was
> approximately
> 2.9M (#documents) * 4 bytes (int) +
> 2.1M (#unique terms) * 4 bytes (int) +
> 50MB (size of the sliding window) +
> 2.9M (#documents) * 4 bytes (int)
> = 12MB + 9MB + 50MB + 12MB ~= 100MB (rounded up).
>
> The 50MB buffer required 3 passes through the terms, which took 47
> seconds.
> Lowering the buffer to 10MB resulted in 2 minutes build-time.
> Upping to 500MB (all terms in memory) meant 31 seconds.
>
> After that, only the order-array was remembered, which took
> 2.9M (#documents) * 4 bytes (int) = 12MB.
>
> Performing subsequent sorted searches gave the following results:
> "hest" (top 10 out of 1295 hits): 6-13ms
> "bog" (top 10 out of 2446835 hits): 600-620ms
>
>
> For comparison, Lucene's build-in sorter took only 8.5 seconds for first
> sort search, with subsequent sorted search times
> "hest" (top 10 out of 1295 hits): 9-18ms
> "bog" (top 10 out of 2446835 hits): 2000-2900ms
>
>
> Disclaimer: This was just a quick test. I have not checked properly how
> the numbers change when the corpus is scaled up, how it works on
> spinning drives and so on. All this is very much at the experimental
> stage.
>
> I'll continue a bit on this road, as some of our setups used nightly
> batch-runs to update the index and thus fit well into the tradeoffs
> outlined above. Regrettably we're also looking at 5-minute updates for a
> larger index (10M documents-range), where the multipass approach is too
> expensive.
>
> - Toke
>
> On Thu, 2009-12-10 at 11:45 +0100, Michael McCandless wrote:
>> The big challenge w/ a global sort ords is handling reopen, ie, for
>> apps that need to frequently reopen their index.
>>
>> If you have a static index, then this approach is a good one -- you
>> make a one time investment, after indexing, to compute your global
>> ords, store them on disk.
>>
>> Then at searching you load the ords in, and you can make a
>> FieldComparator that taps into it (just has to re-base the per-segment
>> docIDs).  No String values are needed...
>>
>> Also, once we've done
>> https://issues.apache.org/jira/browse/LUCENE-1990 (anyone out there
>> wanna take a stab?  Nice self-contained utility API needed for
>> Lucene....), that ord array can be very compact for cases where the
>> number of unique Strings is smallish.  No reason to spend a full 4
>> bytes per document...
>>
>> Mike
>>
>> On Wed, Dec 9, 2009 at 6:46 PM, Toke Eskildsen <t...@statsbiblioteket.dk> 
>> wrote:
>> > Thanks for the heads-up, TCK. The Dietz & Sleator article I found at
>> > http://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf
>> > looks very interesting.
>> >
>> > String sorting in Lucene is indeed fairly expensive and we've experimented 
>> > with
>> > two solutions to this, none of which are golden bullets.
>> >
>> > 1) Store the order, not the content.
>> >
>> > Like the Lucene default sorter, we fetch all the terms for the sort field 
>> > on first
>> > sort. The docIDs are mapped to the terms and an complete sort is performed
>> > (by using a custom Collator: We want to avoid using CollationKeys as they
>> > take too much memory. But that's another story). When the sort is finished,
>> > we have an integer-array where each entry is the relative position for a 
>> > given
>> > document (referenced by docID). We then de-reference the loaded terms
>> > so that the memory is freed.
>> >
>> > Determining the order of two documents after the initial build is extremely
>> > simple: order[docIDa] - order[docIDb] (or maybe it was the other way
>> > around, I forget).
>> >
>> > Cons: Large temporarily memory overhead upon initial sort, though not as
>> > much as using CollationKeys. Slower initial sort than Lucene build-in.
>> > Pros: Very fast subsequent sorts, very low memory-overhead when running.
>> > Joker: This approach could be used to perform a post-processing on a
>> > generated index and storing the orders as a file along with the index, then
>> > pushing the complete package to searchers. This would lower the memory
>> > requirements for the searchers significantly.
>> >
>> >
>> > 2) As above, but use a multipass order builder.
>> >
>> > For the first sort, the order must be calculated: A sliding window sorter 
>> > is created
>> > with a buffer of a fixed size. The window slides through all terms for the 
>> > sort field
>> > and extracts the top X. The order-array is updated for the documents that 
>> > has
>> > one of these terms. The sliding is repeated multiple times, where terms 
>> > ordered
>> > before the last term of the previous iteration are ignored.
>> >
>> > Cons: _Very_ slow (too slow in the current implementation) order build.
>> > Pros: Same as above.
>> > Joker: The buffer size determines memory use vs. order build time.
>> >
>> >
>> > The multipass approach looks promising, but requires more work to get to a
>> > usable state. Right now it takes minutes to build the order-array for half 
>> > a
>> > million documents, with a buffer size requiring 5 iterations. If I ever 
>> > get it to
>> > work, I'll be sure to share it.
>> >
>> > Regards,
>> > Toke Eskildsen
>> >
>> > ________________________________________
>> > From: TCK [moonwatcher32...@gmail.com]
>> > Sent: 09 December 2009 22:58
>> > To: java-user@lucene.apache.org
>> > Subject: Re: heap memory issues when sorting by a string field
>> >
>> > Thanks Mike for opening this jira ticket and for your patch. Explicitly
>> > removing the entry from the WHM definitely does reduce the number of GC
>> > cycles taken to free the huge StringIndex objects that get created when
>> > doing a sort by a string field.
>> >
>> > But I'm still trying to figure out why it is necessary for lucene to load
>> > into memory the sorted-by field of every single document in the index 
>> > (which
>> > is what makes the StringIndex objects so big) on the first query. Why isn't
>> > it sufficient to load the field values for only the documents that match 
>> > the
>> > given search criteria? Perhaps by using an order-maintenance data structure
>> > (Dietz&Sleator, or Bender et al) coupled with a balanced search tree,
>> > instead of the simple lookup and order arrays currently used in 
>> > StringIndex,
>> > we can dynamically grow it as needed by successive queries rather than
>> > loading everything on the first query.
>> >
>> > Apologies if this is a naive question... I'm a newbie to the lucene code 
>> > and
>> > can't wait for the new lucene book to come out:-)
>> >
>> > -TCK
>> >
>> >
>> >
>> >
>> > On Tue, Dec 8, 2009 at 5:43 AM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> I've opened LUCENE-2135.
>> >>
>> >> Mike
>> >>
>> >> On Tue, Dec 8, 2009 at 5:36 AM, Michael McCandless
>> >> <luc...@mikemccandless.com> wrote:
>> >> > This is a rather disturbing implementation detail of WeakHashMap, that
>> >> > it needs the one extra step (invoking one of its methods) for its weak
>> >> > keys to be reclaimable.
>> >> >
>> >> > Maybe on IndexReader.close(), Lucene should go and evict all entries
>> >> > in the FieldCache associated with that reader.  Ie, step through the
>> >> > sub-readers, and if they are truly closed as well (not shared w/ other
>> >> > readers), evict.  I'll open an issue.
>> >> >
>> >> > Even in TCK's code fragment, it's not until the final line is done
>> >> > executing, that the cache key even loses all hard references, because
>> >> > it's that line that assigns to sSearcher, replacing the strong
>> >> > reference to the old searcher.  Inserting sSearcher = null prior to
>> >> > that would drop the hard reference sooner, but because of this impl
>> >> > detail of WeakHashMap, something would still have to touch it (eg, a
>> >> > warmup query that hits the field cache) before it's reclaimable.
>> >> >
>> >> > Mike
>> >> >
>> >> > On Mon, Dec 7, 2009 at 7:38 PM, Tom Hill <solr-l...@worldware.com>
>> >> wrote:
>> >> >> Hi -
>> >> >>
>> >> >> If I understand correctly, WeakHashMap does not free the memory for the
>> >> >> value (cached data) when the key is nulled, or even when the key is
>> >> garbage
>> >> >> collected.
>> >> >>
>> >> >> It requires one more step: a method on WeakHashMap must be called to
>> >> allow
>> >> >> it to release its hard reference to the cached data. It appears that
>> >> most
>> >> >> methods in WeakHashMap end up calling expungeStaleEntries, which will
>> >> clear
>> >> >> the hard reference. But you have to call some method on the map, before
>> >> the
>> >> >> memory is eligible for garbage collection.
>> >> >>
>> >> >> So it requires four stages to free the cached data. Null the key; A GC
>> >> to
>> >> >> release the weak reference to the key; A call to some method on the 
>> >> >> map;
>> >> >> Then the next GC cycle should free the value.
>> >> >>
>> >> >> So it seems possible that you could end up with double memory usage for
>> >> a
>> >> >> time. If you don't have a GC between the time that you close the old
>> >> reader,
>> >> >> and you start to load the field cache entry for the next reader, then
>> >> the
>> >> >> key may still be hanging around uncollected.
>> >> >>
>> >> >> At that point, it may run a GC when you allocate the new cache, but
>> >> that's
>> >> >> only the first GC. It can't free the cached data until after the next
>> >> call
>> >> >> to expungeStaleEntries, so for a while you have both caches around.
>> >> >>
>> >> >> This extra usage could cause things to move into tenured space. Could
>> >> this
>> >> >> be causing your problem?
>> >> >>
>> >> >> Workaround would be to cause some method to be called on the
>> >> WeakHashMap.
>> >> >> You don't want to call get(), since that will try to populate the 
>> >> >> cache.
>> >> >> Maybe if you tried putting a small value to the cache, and doing a GC,
>> >> and
>> >> >> see if your memory drops then.
>> >> >>
>> >> >>
>> >> >> Tom
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Mon, Dec 7, 2009 at 1:48 PM, TCK <moonwatcher32...@gmail.com> wrote:
>> >> >>
>> >> >>> Thanks for the response. But I'm definitely calling close() on the old
>> >> >>> reader and opening a new one (not using reopen). Also, to simplify the
>> >> >>> analysis, I did my test with a single-threaded requester to eliminate
>> >> any
>> >> >>> concurrency issues.
>> >> >>>
>> >> >>> I'm doing:
>> >> >>> sSearcher.getIndexReader().close();
>> >> >>> sSearcher.close(); // this actually seems to be a no-op
>> >> >>> IndexReader newIndexReader = IndexReader.open(newDirectory);
>> >> >>> sSearcher = new IndexSearcher(newIndexReader);
>> >> >>>
>> >> >>> Btw, isn't it bad practice anyway to have an unbounded cache? Are 
>> >> >>> there
>> >> any
>> >> >>> plans to replace the HashMaps used for the innerCaches with an actual
>> >> >>> size-bounded cache with some eviction policy (perhaps EhCache or
>> >> something)
>> >> >>> ?
>> >> >>>
>> >> >>> Thanks again,
>> >> >>> TCK
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Mon, Dec 7, 2009 at 4:37 PM, Erick Erickson <
>> >> erickerick...@gmail.com
>> >> >>> >wrote:
>> >> >>>
>> >> >>> > What this sounds like is that you're not really closing your
>> >> >>> > readers even though you think you are. Sorting indeed uses up
>> >> >>> > significant memory when it populates internal caches and keeps
>> >> >>> > it around for later use (which is one of the reasons that warming
>> >> >>> > queries matter). But if you really do close the reader, I'm pretty
>> >> >>> > sure the memory should be GC-able.
>> >> >>> >
>> >> >>> > One thing that trips people up is IndexReader.reopen(). If it
>> >> >>> > returns a reader different than the original, you *must* close the
>> >> >>> > old one. If you don't, the old reader is still hanging around and
>> >> >>> > memory won't be returne.... An example from the Javadocs...
>> >> >>> >
>> >> >>> >  IndexReader reader = ...
>> >> >>> >  ...
>> >> >>> >  IndexReader new = r.reopen();
>> >> >>> >  if (new != reader) {
>> >> >>> >   ...     // reader was reopened
>> >> >>> >   reader.close();
>> >> >>> >  }
>> >> >>> >  reader = new;
>> >> >>> >  ...
>> >> >>> >
>> >> >>> >
>> >> >>> > If this is irrelevant, could you post your close/open
>> >> >>> >
>> >> >>> > code?
>> >> >>> >
>> >> >>> > HTH
>> >> >>> >
>> >> >>> > Erick
>> >> >>> >
>> >> >>> >
>> >> >>> > On Mon, Dec 7, 2009 at 4:27 PM, TCK <moonwatcher32...@gmail.com>
>> >> wrote:
>> >> >>> >
>> >> >>> > > Hi,
>> >> >>> > > I'm having heap memory issues when I do lucene queries involving
>> >> >>> sorting
>> >> >>> > by
>> >> >>> > > a string field. Such queries seem to load a lot of data in to the
>> >> heap.
>> >> >>> > > Moreover lucene seems to hold on to references to this data even
>> >> after
>> >> >>> > the
>> >> >>> > > index reader has been closed and a full GC has been run. Some of
>> >> the
>> >> >>> > > consequences of this are that in my generational heap 
>> >> >>> > > configuration
>> >> a
>> >> >>> lot
>> >> >>> > > of
>> >> >>> > > memory gets promoted to tenured space each time I close the old
>> >> index
>> >> >>> > > reader
>> >> >>> > > and after opening and querying using a new one, and the tenured
>> >> space
>> >> >>> > > eventually gets fragmented causing a lot of promotion failures
>> >> >>> resulting
>> >> >>> > in
>> >> >>> > > jvm hangs while the jvm does stop-the-world GCs.
>> >> >>> > >
>> >> >>> > > Does anyone know any workarounds to avoid these memory issues when
>> >> >>> doing
>> >> >>> > > such lucene queries?
>> >> >>> > >
>> >> >>> > > My profiling showed that even after a full GC lucene is holding on
>> >> to a
>> >> >>> > lot
>> >> >>> > > of references to field value data notably via the
>> >> >>> > > FieldCacheImpl/ExtendedFieldCacheImpl. I noticed that the
>> >> WeakHashMap
>> >> >>> > > readerCaches are using unbounded HashMaps as the innerCaches and I
>> >> used
>> >> >>> > > reflection to replace these innerCaches with dummy empty HashMaps,
>> >> but
>> >> >>> > > still
>> >> >>> > > I'm seeing the same behavior. I wondered if anyone has gone 
>> >> >>> > > through
>> >> >>> these
>> >> >>> > > same issues before and would offer any advice.
>> >> >>> > >
>> >> >>> > > Thanks a lot,
>> >> >>> > > TCK
>> >> >>> > >
>> >> >>> >
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: heap memory issues when sorting by a string field

Reply via email to