Nice! On Tue, Jun 13, 2017 at 11:12 PM Tom Hirschfeld <tomhirschf...@gmail.com> wrote:
> Hey All, > > I was able to solve my problem a few weeks ago and wanted to update you > all. The root issue was with the caching mechanism in > "makedistancevaluesource" method in the lucene spatial module, it appears > that documents were being pulled into the cache and not expired. To address > this issue, we upgraded our application to lucene 6.5.1 and used the > latLonDocValuesField for indexing/searching. Heap use is back down to > ~500mb for the whole app under load, and the node can support about 5k qps > @ p95 9ms, which is a great improvement from the RPT strategy we had been > using. Once again, thanks for your help. > > Best, > Tom Hirschfeld > > On Thu, May 18, 2017 at 4:22 AM, Uwe Schindler <u...@thetaphi.de> wrote: > > > Hi, > > Are you sure that the term index is the problem? Even with huge indexes > > you never need 65 good of heap! That's impossible. > > Are you sure that your problem is not something else?: > > > > - too large heap? Heaps greater than 31 gigs are bad by default. Lucene > > needs only few heap, although you have large indexes with many terms! You > > can easily run a query on a 100 Gig index with less than 4 gigs of heap. > > The memory used by Lucene is filesystem cache through MMapDirectory, so > you > > need lots of that free, not heap space. Too large heaps are > > contraproductive. > > > > - could it's be that you try to sort on one of those fields and you > > haven't DocValues enabled? Then it leads everything into Heap and you are > > in trouble. > > > > FYI, since Lucene 5 you can get the heap usage of many Lucene components > > using the Accountable interface. E.G., Just call ramBytesUsed() on your > > IndexReader. You can also dive into all components strarting from the > > IndexReader at top level to see which one is using the heap. Just get the > > whole output of the tree as a hierarchical printout using Accountable > > interface. > > > > We need more information to help you. > > Uwe > > > > > > Am 18. Mai 2017 12:56:14 MESZ schrieb Michael McCandless < > > luc...@mikemccandless.com>: > >> > >> That sounds like a fun amount of terms! > >> > >> Note that Lucene does not load all terms into memory; only the "prefix > >> trie", stored as an FST ( > >> > http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html > ), > >> mapping term prefixes to on-disk blocks of terms. FSTs are very compact > >> data structures, effectively implementing SortedMap<String,T>, so it's > >> surprising you need 65 G heap for the FSTs. > >> > >> Anyway, with the BlockTreeTermsWriter/Reader, the equivalent of the old > >> termInfosIndexDivisor is to change the allowed on-disk block size > (defaults > >> to 25 - 48 terms per block) to something larger. To do this, make your > own > >> subclass of FilterCodec, passing the current default codec to wrap, and > >> override the postingsFormat method to return a "new > >> Lucene50PostingsFormat(...)" passing a larger min and max block size. > This > >> applies at indexing time, so you need to reindex to see your FSTs get > >> smaller. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> On Wed, May 17, 2017 at 5:26 PM, Tom Hirschfeld < > tomhirschf...@gmail.com> > >> wrote: > >> > >> Hey! > >>> > >>> I am working on a lucene based service for reverse geocoding. We have > a > >>> large index with lots of unique terms (550 million) and it appears > that > >>> we're running into issue with memory on our leaf servers as the term > >>> dictionary for the entire index is being loaded into heap space. If we > >>> allocate > 65g heap space, our queries return relatively quickly (10s > -100s > >>> of ms), but if we drop below ~65g heap space on the leaf nodes, query > time > >>> drops dramatically, quickly hitting 20+ seconds (our test harness > drops at > >>> 20s). > >>> > >>> I did some research, and found in past versions of lucene, one could > split > >>> the loading of the terms dictionary using the 'termInfosIndexDivisor' > >>> option in the directoryReader class. That option was deprecated in > lucene > >>> 5.0.0 > >>> < > https://abi-laboratory.pro/java/tracker/changelog/lucene/5.0.0/log.html> > >>> in > >>> favor of using codecs to achieve similar functionality. Looking at the > >>> available experimental codecs. I see the BlockTreeTermsWriter > >>> <https://lucene.apache.org/core/5_3_1/core/org/apache/ > >>> lucene/codecs/blocktree/BlockTreeTermsWriter.html# > >>> BlockTreeTermsWriter(org.apache.lucene.index.SegmentWriteState, > >>> org.apache.lucene.codecs.PostingsWriterBase, int, int)> that seems > like it > >>> could be used for a similar purpose, breaking down the term > dictionary so > >>> that we don't load the whole thing into heap space. > >>> > >>> Has anyone run into this problem before and found an effective > solution? > >>> Does changing the codec used seem appropriate for this issue? If so, > how do > >>> I got about loading an alternative codec and configuring it to my > needs? > >>> I'm having trouble finding docs/examples of how this is used in the > real > >>> world so even if you point me to a repo or docs somewhere I'd > appreciate > >>> it. > >>> Thanks! > >>> > >>> Best, > >>> Tom Hirschfeld > >> > >> > >> > > -- > > Uwe Schindler > > Achterdiek 19, 28357 Bremen > > https://www.thetaphi.de > > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com