How long does Lucene take to build the ords for the toplevel reader? You should be able to just time FieldCache.getStringIndex(topLevelReader).
I think your 8.5 seconds for first Lucene search was with the StringIndex computed per segment? Mike On Fri, Dec 11, 2009 at 8:30 AM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > I've spend the last day working on a multipass order builder, where the > order is defined by a Collator and stored in an int-array. Compromising > a bit on the "minimal memory at all cost"-approach resulted in a fair > boost in speed, but it's still very slow for the first sorted search, > compared to Lucene's build-in Collator-based sorter. > > One test-corpus was 2.9M documents (16GB index), sorting on title (2.1M > unique terms). The machine was a fairly new Xeon-something using SSDs. > Lucene 2.4 was used. > > Memory-usage for the multipass order builder on first sort was > approximately > 2.9M (#documents) * 4 bytes (int) + > 2.1M (#unique terms) * 4 bytes (int) + > 50MB (size of the sliding window) + > 2.9M (#documents) * 4 bytes (int) > = 12MB + 9MB + 50MB + 12MB ~= 100MB (rounded up). > > The 50MB buffer required 3 passes through the terms, which took 47 > seconds. > Lowering the buffer to 10MB resulted in 2 minutes build-time. > Upping to 500MB (all terms in memory) meant 31 seconds. > > After that, only the order-array was remembered, which took > 2.9M (#documents) * 4 bytes (int) = 12MB. > > Performing subsequent sorted searches gave the following results: > "hest" (top 10 out of 1295 hits): 6-13ms > "bog" (top 10 out of 2446835 hits): 600-620ms > > > For comparison, Lucene's build-in sorter took only 8.5 seconds for first > sort search, with subsequent sorted search times > "hest" (top 10 out of 1295 hits): 9-18ms > "bog" (top 10 out of 2446835 hits): 2000-2900ms > > > Disclaimer: This was just a quick test. I have not checked properly how > the numbers change when the corpus is scaled up, how it works on > spinning drives and so on. All this is very much at the experimental > stage. > > I'll continue a bit on this road, as some of our setups used nightly > batch-runs to update the index and thus fit well into the tradeoffs > outlined above. Regrettably we're also looking at 5-minute updates for a > larger index (10M documents-range), where the multipass approach is too > expensive. > > - Toke > > On Thu, 2009-12-10 at 11:45 +0100, Michael McCandless wrote: >> The big challenge w/ a global sort ords is handling reopen, ie, for >> apps that need to frequently reopen their index. >> >> If you have a static index, then this approach is a good one -- you >> make a one time investment, after indexing, to compute your global >> ords, store them on disk. >> >> Then at searching you load the ords in, and you can make a >> FieldComparator that taps into it (just has to re-base the per-segment >> docIDs). No String values are needed... >> >> Also, once we've done >> https://issues.apache.org/jira/browse/LUCENE-1990 (anyone out there >> wanna take a stab? Nice self-contained utility API needed for >> Lucene....), that ord array can be very compact for cases where the >> number of unique Strings is smallish. No reason to spend a full 4 >> bytes per document... >> >> Mike >> >> On Wed, Dec 9, 2009 at 6:46 PM, Toke Eskildsen <t...@statsbiblioteket.dk> >> wrote: >> > Thanks for the heads-up, TCK. The Dietz & Sleator article I found at >> > http://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf >> > looks very interesting. >> > >> > String sorting in Lucene is indeed fairly expensive and we've experimented >> > with >> > two solutions to this, none of which are golden bullets. >> > >> > 1) Store the order, not the content. >> > >> > Like the Lucene default sorter, we fetch all the terms for the sort field >> > on first >> > sort. The docIDs are mapped to the terms and an complete sort is performed >> > (by using a custom Collator: We want to avoid using CollationKeys as they >> > take too much memory. But that's another story). When the sort is finished, >> > we have an integer-array where each entry is the relative position for a >> > given >> > document (referenced by docID). We then de-reference the loaded terms >> > so that the memory is freed. >> > >> > Determining the order of two documents after the initial build is extremely >> > simple: order[docIDa] - order[docIDb] (or maybe it was the other way >> > around, I forget). >> > >> > Cons: Large temporarily memory overhead upon initial sort, though not as >> > much as using CollationKeys. Slower initial sort than Lucene build-in. >> > Pros: Very fast subsequent sorts, very low memory-overhead when running. >> > Joker: This approach could be used to perform a post-processing on a >> > generated index and storing the orders as a file along with the index, then >> > pushing the complete package to searchers. This would lower the memory >> > requirements for the searchers significantly. >> > >> > >> > 2) As above, but use a multipass order builder. >> > >> > For the first sort, the order must be calculated: A sliding window sorter >> > is created >> > with a buffer of a fixed size. The window slides through all terms for the >> > sort field >> > and extracts the top X. The order-array is updated for the documents that >> > has >> > one of these terms. The sliding is repeated multiple times, where terms >> > ordered >> > before the last term of the previous iteration are ignored. >> > >> > Cons: _Very_ slow (too slow in the current implementation) order build. >> > Pros: Same as above. >> > Joker: The buffer size determines memory use vs. order build time. >> > >> > >> > The multipass approach looks promising, but requires more work to get to a >> > usable state. Right now it takes minutes to build the order-array for half >> > a >> > million documents, with a buffer size requiring 5 iterations. If I ever >> > get it to >> > work, I'll be sure to share it. >> > >> > Regards, >> > Toke Eskildsen >> > >> > ________________________________________ >> > From: TCK [moonwatcher32...@gmail.com] >> > Sent: 09 December 2009 22:58 >> > To: java-user@lucene.apache.org >> > Subject: Re: heap memory issues when sorting by a string field >> > >> > Thanks Mike for opening this jira ticket and for your patch. Explicitly >> > removing the entry from the WHM definitely does reduce the number of GC >> > cycles taken to free the huge StringIndex objects that get created when >> > doing a sort by a string field. >> > >> > But I'm still trying to figure out why it is necessary for lucene to load >> > into memory the sorted-by field of every single document in the index >> > (which >> > is what makes the StringIndex objects so big) on the first query. Why isn't >> > it sufficient to load the field values for only the documents that match >> > the >> > given search criteria? Perhaps by using an order-maintenance data structure >> > (Dietz&Sleator, or Bender et al) coupled with a balanced search tree, >> > instead of the simple lookup and order arrays currently used in >> > StringIndex, >> > we can dynamically grow it as needed by successive queries rather than >> > loading everything on the first query. >> > >> > Apologies if this is a naive question... I'm a newbie to the lucene code >> > and >> > can't wait for the new lucene book to come out:-) >> > >> > -TCK >> > >> > >> > >> > >> > On Tue, Dec 8, 2009 at 5:43 AM, Michael McCandless < >> > luc...@mikemccandless.com> wrote: >> > >> >> I've opened LUCENE-2135. >> >> >> >> Mike >> >> >> >> On Tue, Dec 8, 2009 at 5:36 AM, Michael McCandless >> >> <luc...@mikemccandless.com> wrote: >> >> > This is a rather disturbing implementation detail of WeakHashMap, that >> >> > it needs the one extra step (invoking one of its methods) for its weak >> >> > keys to be reclaimable. >> >> > >> >> > Maybe on IndexReader.close(), Lucene should go and evict all entries >> >> > in the FieldCache associated with that reader. Ie, step through the >> >> > sub-readers, and if they are truly closed as well (not shared w/ other >> >> > readers), evict. I'll open an issue. >> >> > >> >> > Even in TCK's code fragment, it's not until the final line is done >> >> > executing, that the cache key even loses all hard references, because >> >> > it's that line that assigns to sSearcher, replacing the strong >> >> > reference to the old searcher. Inserting sSearcher = null prior to >> >> > that would drop the hard reference sooner, but because of this impl >> >> > detail of WeakHashMap, something would still have to touch it (eg, a >> >> > warmup query that hits the field cache) before it's reclaimable. >> >> > >> >> > Mike >> >> > >> >> > On Mon, Dec 7, 2009 at 7:38 PM, Tom Hill <solr-l...@worldware.com> >> >> wrote: >> >> >> Hi - >> >> >> >> >> >> If I understand correctly, WeakHashMap does not free the memory for the >> >> >> value (cached data) when the key is nulled, or even when the key is >> >> garbage >> >> >> collected. >> >> >> >> >> >> It requires one more step: a method on WeakHashMap must be called to >> >> allow >> >> >> it to release its hard reference to the cached data. It appears that >> >> most >> >> >> methods in WeakHashMap end up calling expungeStaleEntries, which will >> >> clear >> >> >> the hard reference. But you have to call some method on the map, before >> >> the >> >> >> memory is eligible for garbage collection. >> >> >> >> >> >> So it requires four stages to free the cached data. Null the key; A GC >> >> to >> >> >> release the weak reference to the key; A call to some method on the >> >> >> map; >> >> >> Then the next GC cycle should free the value. >> >> >> >> >> >> So it seems possible that you could end up with double memory usage for >> >> a >> >> >> time. If you don't have a GC between the time that you close the old >> >> reader, >> >> >> and you start to load the field cache entry for the next reader, then >> >> the >> >> >> key may still be hanging around uncollected. >> >> >> >> >> >> At that point, it may run a GC when you allocate the new cache, but >> >> that's >> >> >> only the first GC. It can't free the cached data until after the next >> >> call >> >> >> to expungeStaleEntries, so for a while you have both caches around. >> >> >> >> >> >> This extra usage could cause things to move into tenured space. Could >> >> this >> >> >> be causing your problem? >> >> >> >> >> >> Workaround would be to cause some method to be called on the >> >> WeakHashMap. >> >> >> You don't want to call get(), since that will try to populate the >> >> >> cache. >> >> >> Maybe if you tried putting a small value to the cache, and doing a GC, >> >> and >> >> >> see if your memory drops then. >> >> >> >> >> >> >> >> >> Tom >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Dec 7, 2009 at 1:48 PM, TCK <moonwatcher32...@gmail.com> wrote: >> >> >> >> >> >>> Thanks for the response. But I'm definitely calling close() on the old >> >> >>> reader and opening a new one (not using reopen). Also, to simplify the >> >> >>> analysis, I did my test with a single-threaded requester to eliminate >> >> any >> >> >>> concurrency issues. >> >> >>> >> >> >>> I'm doing: >> >> >>> sSearcher.getIndexReader().close(); >> >> >>> sSearcher.close(); // this actually seems to be a no-op >> >> >>> IndexReader newIndexReader = IndexReader.open(newDirectory); >> >> >>> sSearcher = new IndexSearcher(newIndexReader); >> >> >>> >> >> >>> Btw, isn't it bad practice anyway to have an unbounded cache? Are >> >> >>> there >> >> any >> >> >>> plans to replace the HashMaps used for the innerCaches with an actual >> >> >>> size-bounded cache with some eviction policy (perhaps EhCache or >> >> something) >> >> >>> ? >> >> >>> >> >> >>> Thanks again, >> >> >>> TCK >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Mon, Dec 7, 2009 at 4:37 PM, Erick Erickson < >> >> erickerick...@gmail.com >> >> >>> >wrote: >> >> >>> >> >> >>> > What this sounds like is that you're not really closing your >> >> >>> > readers even though you think you are. Sorting indeed uses up >> >> >>> > significant memory when it populates internal caches and keeps >> >> >>> > it around for later use (which is one of the reasons that warming >> >> >>> > queries matter). But if you really do close the reader, I'm pretty >> >> >>> > sure the memory should be GC-able. >> >> >>> > >> >> >>> > One thing that trips people up is IndexReader.reopen(). If it >> >> >>> > returns a reader different than the original, you *must* close the >> >> >>> > old one. If you don't, the old reader is still hanging around and >> >> >>> > memory won't be returne.... An example from the Javadocs... >> >> >>> > >> >> >>> > IndexReader reader = ... >> >> >>> > ... >> >> >>> > IndexReader new = r.reopen(); >> >> >>> > if (new != reader) { >> >> >>> > ... // reader was reopened >> >> >>> > reader.close(); >> >> >>> > } >> >> >>> > reader = new; >> >> >>> > ... >> >> >>> > >> >> >>> > >> >> >>> > If this is irrelevant, could you post your close/open >> >> >>> > >> >> >>> > code? >> >> >>> > >> >> >>> > HTH >> >> >>> > >> >> >>> > Erick >> >> >>> > >> >> >>> > >> >> >>> > On Mon, Dec 7, 2009 at 4:27 PM, TCK <moonwatcher32...@gmail.com> >> >> wrote: >> >> >>> > >> >> >>> > > Hi, >> >> >>> > > I'm having heap memory issues when I do lucene queries involving >> >> >>> sorting >> >> >>> > by >> >> >>> > > a string field. Such queries seem to load a lot of data in to the >> >> heap. >> >> >>> > > Moreover lucene seems to hold on to references to this data even >> >> after >> >> >>> > the >> >> >>> > > index reader has been closed and a full GC has been run. Some of >> >> the >> >> >>> > > consequences of this are that in my generational heap >> >> >>> > > configuration >> >> a >> >> >>> lot >> >> >>> > > of >> >> >>> > > memory gets promoted to tenured space each time I close the old >> >> index >> >> >>> > > reader >> >> >>> > > and after opening and querying using a new one, and the tenured >> >> space >> >> >>> > > eventually gets fragmented causing a lot of promotion failures >> >> >>> resulting >> >> >>> > in >> >> >>> > > jvm hangs while the jvm does stop-the-world GCs. >> >> >>> > > >> >> >>> > > Does anyone know any workarounds to avoid these memory issues when >> >> >>> doing >> >> >>> > > such lucene queries? >> >> >>> > > >> >> >>> > > My profiling showed that even after a full GC lucene is holding on >> >> to a >> >> >>> > lot >> >> >>> > > of references to field value data notably via the >> >> >>> > > FieldCacheImpl/ExtendedFieldCacheImpl. I noticed that the >> >> WeakHashMap >> >> >>> > > readerCaches are using unbounded HashMaps as the innerCaches and I >> >> used >> >> >>> > > reflection to replace these innerCaches with dummy empty HashMaps, >> >> but >> >> >>> > > still >> >> >>> > > I'm seeing the same behavior. I wondered if anyone has gone >> >> >>> > > through >> >> >>> these >> >> >>> > > same issues before and would offer any advice. >> >> >>> > > >> >> >>> > > Thanks a lot, >> >> >>> > > TCK >> >> >>> > > >> >> >>> > >> >> >>> >> >> >> >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org