Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

David Smiley Wed, 14 Jun 2017 11:39:43 -0700

Nice!

On Tue, Jun 13, 2017 at 11:12 PM Tom Hirschfeld <[email protected]>
wrote:


> Hey All,
>
> I was able to solve my problem a few weeks ago and wanted to update you
> all. The root issue was with the caching mechanism in
> "makedistancevaluesource" method in the lucene spatial module, it appears
> that documents were being pulled into the cache and not expired. To address
> this issue, we upgraded our application to lucene 6.5.1 and used the
> latLonDocValuesField for indexing/searching. Heap use is back down to
> ~500mb for the whole app under load, and the node can support about 5k qps
> @ p95 9ms, which is a great improvement from the RPT strategy we had been
> using. Once again, thanks for your help.
>
> Best,
> Tom Hirschfeld
>
> On Thu, May 18, 2017 at 4:22 AM, Uwe Schindler <[email protected]> wrote:
>
> > Hi,
> > Are you sure that the term index is the problem? Even with huge indexes
> > you never need 65 good of heap! That's impossible.
> > Are you sure that your problem is not something else?:
> >
> > - too large heap? Heaps greater than 31 gigs are bad by default. Lucene
> > needs only few heap, although you have large indexes with many terms! You
> > can easily run a query on a 100 Gig index with less than 4 gigs of heap.
> > The memory used by Lucene is filesystem cache through MMapDirectory, so
> you
> > need lots of that free, not heap space. Too large heaps are
> > contraproductive.
> >
> > - could it's be that you try to sort on one of those fields and you
> > haven't DocValues enabled? Then it leads everything into Heap and you are
> > in trouble.
> >
> > FYI, since Lucene 5 you can get the heap usage of many Lucene components
> > using the Accountable interface. E.G., Just call ramBytesUsed() on your
> > IndexReader. You can also dive into all components strarting from the
> > IndexReader at top level to see which one is using the heap. Just get the
> > whole output of the tree as a hierarchical printout using Accountable
> > interface.
> >
> > We need more information to help you.
> > Uwe
> >
> >
> > Am 18. Mai 2017 12:56:14 MESZ schrieb Michael McCandless <
> > [email protected]>:
> >>
> >> That sounds like a fun amount of terms!
> >>
> >> Note that Lucene does not load all terms into memory; only the "prefix
> >> trie", stored as an FST (
> >>
> http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
> ),
> >> mapping term prefixes to on-disk blocks of terms.  FSTs are very compact
> >> data structures, effectively implementing SortedMap<String,T>, so it's
> >> surprising you need 65 G heap for the FSTs.
> >>
> >> Anyway, with the BlockTreeTermsWriter/Reader, the equivalent of the old
> >> termInfosIndexDivisor is to change the allowed on-disk block size
> (defaults
> >> to 25 - 48 terms per block) to something larger.  To do this, make your
> own
> >> subclass of FilterCodec, passing the current default codec to wrap, and
> >> override the postingsFormat method to return a "new
> >> Lucene50PostingsFormat(...)" passing a larger min and max block size.
> This
> >> applies at indexing time, so you need to reindex to see your FSTs get
> >> smaller.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Wed, May 17, 2017 at 5:26 PM, Tom Hirschfeld <
> [email protected]>
> >> wrote:
> >>
> >>  Hey!
> >>>
> >>>  I am working on a lucene based service for reverse geocoding. We have
> a
> >>>  large index with lots of unique terms (550 million) and it appears
> that
> >>>  we're running into issue with memory on our leaf servers as the term
> >>>  dictionary for the entire index is being loaded into heap space. If we
> >>>  allocate > 65g heap space, our queries return relatively quickly (10s
> -100s
> >>>  of ms), but if we drop below ~65g heap space on the leaf nodes, query
> time
> >>>  drops dramatically, quickly hitting 20+ seconds (our test harness
> drops at
> >>>  20s).
> >>>
> >>>  I did some research, and found in past versions of lucene, one could
> split
> >>>  the loading of the terms dictionary using the 'termInfosIndexDivisor'
> >>>  option in the directoryReader class. That option was deprecated in
> lucene
> >>>  5.0.0
> >>>  <
> https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html>
> >>>  in
> >>>  favor of using codecs to achieve similar functionality. Looking at the
> >>>  available experimental codecs. I see the BlockTreeTermsWriter
> >>>  <https://lucene.apache.org/core/5_3_1/core/org/apache/
> >>>  lucene/codecs/blocktree/BlockTreeTermsWriter.html#
> >>>  BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState,
> >>>  org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems
> like it
> >>>  could be used for a similar purpose, breaking down the term
> dictionary so
> >>>  that we don't load the whole thing into heap space.
> >>>
> >>>  Has anyone run into this problem before and found an effective
> solution?
> >>>  Does changing the codec used seem appropriate for this issue? If so,
> how do
> >>>  I got about loading an alternative codec and configuring it to my
> needs?
> >>>  I'm having trouble finding docs/examples of how this is used in the
> real
> >>>  world so even if you point me to a repo or docs somewhere I'd
> appreciate
> >>>  it.
> >>>  Thanks!
> >>>
> >>>  Best,
> >>>  Tom Hirschfeld
> >>
> >>
> >>
> > --
> > Uwe Schindler
> > Achterdiek 19, 28357 Bremen
> > https://www.thetaphi.de
> >
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Term Dictionary taking up lots of memory, looking for solutions, lucene 5.3.1

Reply via email to