Here's the link: https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing
I'm indexing let's say 11 unique fields per document. Also, the NRT reader is opened continually, and "regular" searches use that one. But a special kind of feature allows searching a particular point in time (they get cleaned out based on some other logic), which requires opening a non-NRT reader just to service such search requests - in my understanding no segment readers for this reader can be shared with the NRT reader's pool... or am I off here? This seems evident from another heap dump fragment that shows a full new set of segment readers attached to that "temporary" reader: https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hmm screen shot didn't make it ... can you post link? > > If you are using NRT reader then when a new one is opened, it won't > open new SegmentReaders for all segments, just for newly > flushed/merged segments since the last reader was opened. So for your > N commit points that you have readers open for, they will be sharing > SegmentReaders for segments they have in common. > > How many unique fields are you adding? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein <vfunst...@gmail.com> > wrote: > > Mike, > > > > Here's the screenshot; not sure if it will go through as an attachment > > though - if not, I'll post it as a link. Please ignore the altered > package > > names, since Lucene is shaded in as part of our build process. > > > > Some more context about the use case. Yes, the terms are pretty much > unique; > > the schema for the data set is actually borrowed from here: > > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits > > set, with a couple of other fields added by us. The values for the fields > > are generated almost randomly, though some string fields are picked at > > random from a fixed dictionary. > > > > Also, this type of heap footprint might be tolerable if it stayed > relatively > > constant throughout the system's life cycle (of course, given the index > set > > stays more or less static). However, what happens here is that one > > IndexReader reference is maintained by ReaderManager as an NRT reader. > But > > we also would like support an ability to execute searches against > specific > > index commit points, ideally in parallel. As you might imagine, as soon > as a > > new DirectoryReader is opened at a given commit, a whole new set of > > SegmentReader instances is created and populated, effectively doubling > the > > already large heap usage... if there was a way to somehow reuse readers > for > > unchanged segments already pooled by IndexWriter, that would help > > tremendously here. But I don't think there's a way to link up the two > sets, > > at least not in the Lucene version we are using (4.6.1) - is this > correct? > > > > > > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless > > <luc...@mikemccandless.com> wrote: > >> > >> This is surprising: unless you have an excessive number of unique > >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer. > >> > >> Bu you only have 12 unique fields? > >> > >> Can you post screen shots of the heap usage? > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein <vfunst...@gmail.com> > >> wrote: > >> > This is a follow up to the earlier thread I started to understand > memory > >> > usage patterns of SegmentReader instances, but I decided to create a > >> > separate post since this issue is much more serious than the heap > >> > overhead > >> > created by use of stored field compression. > >> > > >> > Here is the use case, once again. The index totals around 300M > >> > documents, > >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are > >> > both > >> > indexed and stored. It is split into 4 shards, which are basically > >> > separate > >> > indices... if that matters. After the index is populated (but not > >> > optimized > >> > since we don't do that), the overall heap usage taken up by Lucene is > >> > over > >> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader. > >> > For > >> > instance for the largest segment in one such an index, the retained > heap > >> > size of the internal tree map is around 50 MB. This is evident from > heap > >> > dump analysis, which I have screenshots of that I can post here, if > that > >> > helps. As there are many segments of various sizes in the index, as > >> > expected, the total heap usage for one shard stands at around 280 MB. > >> > > >> > Could someone shed some light on whether this is expected, and if so - > >> > how > >> > could I possibly trim down memory usage here? Is there a way to switch > >> > to a > >> > different terms index implementation, one that doesn't preload all the > >> > terms into RAM, or only does this partially, i.e. as a cache? I'm not > >> > sure > >> > if I'm framing my questions correctly, as I'm obviously not an expert > on > >> > Lucene's internals, but this is going to become a critical issue for > >> > large > >> > scale use cases of our system. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >