Here's the link:
https://drive.google.com/file/d/0B5eRTXMELFjjbUhSUW9pd2lVN00/edit?usp=sharing

I'm indexing let's say 11 unique fields per document. Also, the NRT reader
is opened continually, and "regular" searches use that one. But a special
kind of feature allows searching a particular point in time (they get
cleaned out based on some other logic), which requires opening a non-NRT
reader just to service such search requests - in my understanding no
segment readers for this reader can be shared with the NRT reader's pool...
or am I off here? This seems evident from another heap dump fragment that
shows a full new set of segment readers attached to that "temporary"
reader:

https://drive.google.com/file/d/0B5eRTXMELFjjSENXZV9kejR3bDA/edit?usp=sharing


On Thu, Aug 28, 2014 at 10:13 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm screen shot didn't make it ... can you post link?
>
> If you are using NRT reader then when a new one is opened, it won't
> open new SegmentReaders for all segments, just for newly
> flushed/merged segments since the last reader was opened.  So for your
> N commit points that you have readers open for, they will be sharing
> SegmentReaders for segments they have in common.
>
> How many unique fields are you adding?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Funstein <vfunst...@gmail.com>
> wrote:
> > Mike,
> >
> > Here's the screenshot; not sure if it will go through as an attachment
> > though - if not, I'll post it as a link. Please ignore the altered
> package
> > names, since Lucene is shaded in as part of our build process.
> >
> > Some more context about the use case. Yes, the terms are pretty much
> unique;
> > the schema for the data set is actually borrowed from here:
> > https://amplab.cs.berkeley.edu/benchmark/#workload - it's the UserVisits
> > set, with a couple of other fields added by us. The values for the fields
> > are generated almost randomly, though some string fields are picked at
> > random from a fixed dictionary.
> >
> > Also, this type of heap footprint might be tolerable if it stayed
> relatively
> > constant throughout the system's life cycle (of course, given the index
> set
> > stays more or less static). However, what happens here is that one
> > IndexReader reference is maintained by ReaderManager as an NRT reader.
> But
> > we also would like support an ability to execute searches against
> specific
> > index commit points, ideally in parallel. As you might imagine, as soon
> as a
> > new DirectoryReader is opened at a given commit, a whole new set of
> > SegmentReader instances is created and populated, effectively doubling
> the
> > already large heap usage... if there was a way to somehow reuse readers
> for
> > unchanged segments already pooled by IndexWriter, that would help
> > tremendously here. But I don't think there's a way to link up the two
> sets,
> > at least not in the Lucene version we are using (4.6.1) - is this
> correct?
> >
> >
> > On Wed, Aug 27, 2014 at 12:56 AM, Michael McCandless
> > <luc...@mikemccandless.com> wrote:
> >>
> >> This is surprising: unless you have an excessive number of unique
> >> fields, BlockTreeTermReader shouldn't be such a big RAM consumer.
> >>
> >> Bu you only have 12 unique fields?
> >>
> >> Can you post screen shots of the heap usage?
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein <vfunst...@gmail.com>
> >> wrote:
> >> > This is a follow up to the earlier thread I started to understand
> memory
> >> > usage patterns of SegmentReader instances, but I decided to create a
> >> > separate post since this issue is much more serious than the heap
> >> > overhead
> >> > created by use of stored field compression.
> >> >
> >> > Here is the use case, once again. The index totals around 300M
> >> > documents,
> >> > with 7 string, 2 long, 1 integer, 1 date and 1 float fields which are
> >> > both
> >> > indexed and stored. It is split into 4 shards, which are basically
> >> > separate
> >> > indices... if that matters. After the index is populated (but not
> >> > optimized
> >> > since we don't do that), the overall heap usage taken up by Lucene is
> >> > over
> >> > 1 GB, much of which is taken up by instances of BlockTreeTermsReader.
> >> > For
> >> > instance for the largest segment in one such an index, the retained
> heap
> >> > size of the internal tree map is around 50 MB. This is evident from
> heap
> >> > dump analysis, which I have screenshots of that I can post here, if
> that
> >> > helps. As there are many segments of various sizes in the index, as
> >> > expected, the total heap usage for one shard stands at around 280 MB.
> >> >
> >> > Could someone shed some light on whether this is expected, and if so -
> >> > how
> >> > could I possibly trim down memory usage here? Is there a way to switch
> >> > to a
> >> > different terms index implementation, one that doesn't preload all the
> >> > terms into RAM, or only does this partially, i.e. as a cache? I'm not
> >> > sure
> >> > if I'm framing my questions correctly, as I'm obviously not an expert
> on
> >> > Lucene's internals, but this is going to become a critical issue for
> >> > large
> >> > scale use cases of our system.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to