Re: IndexWriter memory leak?

Michael McCandless Sat, 10 Apr 2010 02:32:45 -0700

Thank you for isolating & raising the issue!!

Mike


On Sat, Apr 10, 2010 at 1:28 AM, Ruben Laguna <ruben.lag...@gmail.com> wrote:
> I just tried the changes that you commited it works beatifully. The readers
> are GCed. Thanks for both LUCENE-2387 and LUCENE-2384. Those make a big
> difference in my app!
>
> On Fri, Apr 9, 2010 at 12:32 PM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>>
>> I agree IW should not hold refs to the Field instances from the last
>> doc indexed... I put a patch on LUCENE-2387 to null the reference as
>> we go.  Can you confirm this lets GC reclaim?
>>
>> Mike
>>
>> On Fri, Apr 9, 2010 at 12:54 AM, Ruben Laguna <ruben.lag...@gmail.com>
>> wrote:
>> > But the Readers I'm talking about are not held by the Tokenizer (at
>> > least
>> > not *only* by it), these are held by the DocFieldProccessorPerThread....
>> >
>> > IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState ->
>> > DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable ->
>> > Field (fieldsData)
>> >
>> > and it's not only one Reader there are several one (one per thread I
>> > suppose, in my heapdump there is 25 Reader that should have been GCed
>> > otherwise).
>> >
>> > Best regards/Ruben
>> > On Thu, Apr 8, 2010 at 11:49 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>> >
>> >> There is one possibility, that could be fixed:
>> >>
>> >> As Tokenizers are reused, the analyzer holds a reference to the last
>> >> used
>> >> Reader. The easy fix would be to unset the Reader in Tokenizer.close().
>> >> If
>> >> this is the case for you, that may be easy to do. So Tokenizer.close()
>> >> looks
>> >> like this:
>> >>
>> >>  /** By default, closes the input Reader. */
>> >> �...@override
>> >>  public void close() throws IOException {
>> >>    input.close();
>> >>    input = null; // <-- new!
>> >>  }
>> >>
>> >> Any comments from other committers?
>> >>
>> >> -----
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Ruben Laguna [mailto:ruben.lag...@gmail.com]
>> >> > Sent: Thursday, April 08, 2010 2:50 PM
>> >> > To: java-u...@lucene.apache.org
>> >> > Subject: Re: IndexWriter memory leak?
>> >> >
>> >> > I will double check in the afternoon the heapdump.hprof. But I think
>> >> > that
>> >> > *some* readers are indeed held by
>> >> > docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx],
>> >> > as shown in [1] (this heapdump contains only live objects).  The
>> >> > heapdump
>> >> > was taken after IndexWriter.commit() /IndexWriter.optimize() and all
>> >> > the
>> >> > Documents were already indexed and GCed (I will double check).
>> >> >
>> >> > So that would mean that the Reader is retained in memory by the
>> >> > following
>> >> > chaing of references,
>> >> >
>> >> > DocumentsWriter -> DocumentsWriterThreadState ->
>> >> > DocFieldProcessorPerThread
>> >> > -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData)
>> >> >
>> >> > I'll double check with Eclipse MAT as I said that this chain is
>> >> > actually
>> >> > made of hard references only (no SoftReferences,WeakReferences, etc).
>> >> > I
>> >> > will
>> >> > also double check also that there is no "live" Document that is
>> >> > referencing
>> >> > the Reader via the Field.
>> >> >
>> >> >
>> >> > [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
>> >> >
>> >> > On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <u...@thetaphi.de>
>> >> > wrote:
>> >> >
>> >> > > Readers are not held. If you indexed the document and gced the
>> >> > document
>> >> > > instance they readers are gone.
>> >> > >
>> >> > > -----
>> >> > > Uwe Schindler
>> >> > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> > > http://www.thetaphi.de
>> >> > > eMail: u...@thetaphi.de
>> >> > >
>> >> > >
>> >> > > > -----Original Message-----
>> >> > > > From: Ruben Laguna [mailto:ruben.lag...@gmail.com]
>> >> > > > Sent: Thursday, April 08, 2010 1:28 PM
>> >> > > > To: java-u...@lucene.apache.org
>> >> > > > Subject: Re: IndexWriter memory leak?
>> >> > > >
>> >> > > > Now that the zzBuffer issue is solved...
>> >> > > >
>> >> > > > what about the references to the Readers held by docWriter.
>> >> > > > Tika´s
>> >> > > > ParsingReaders are quite heavyweight so retaining those in memory
>> >> > > > unnecesarily is also a "hidden" memory leak. Should I open a bug
>> >> > report
>> >> > > > on
>> >> > > > that one?
>> >> > > >
>> >> > > > /Rubén
>> >> > > >
>> >> > > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <ser...@gmail.com>
>> >> > wrote:
>> >> > > >
>> >> > > > > Guess we were replying at the same time :).
>> >> > > > >
>> >> > > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <u...@thetaphi.de>
>> >> > > > wrote:
>> >> > > > >
>> >> > > > > > I already answered, that I will take care of this!
>> >> > > > > >
>> >> > > > > > Uwe
>> >> > > > > >
>> >> > > > > > -----
>> >> > > > > > Uwe Schindler
>> >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> > > > > > http://www.thetaphi.de
>> >> > > > > > eMail: u...@thetaphi.de
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > > -----Original Message-----
>> >> > > > > > > From: Shai Erera [mailto:ser...@gmail.com]
>> >> > > > > > > Sent: Thursday, April 08, 2010 12:00 PM
>> >> > > > > > > To: java-u...@lucene.apache.org
>> >> > > > > > > Subject: Re: IndexWriter memory leak?
>> >> > > > > > >
>> >> > > > > > > Yes, that's the trimBuffer version I was thinking about,
>> >> > > > > > > only
>> >> > > > this guy
>> >> > > > > > > created a reset(Reader, int) and does both ops (resetting +
>> >> > trim)
>> >> > > > in
>> >> > > > > > > one
>> >> > > > > > > method call. More convenient. Can you please open an issue
>> >> > > > > > > to
>> >> > > > track
>> >> > > > > > > that?
>> >> > > > > > > People will have a chance to comment on whether we (Lucene)
>> >> > > > should
>> >> > > > > > > handle
>> >> > > > > > > that, or it should be a JFlex fix. Based on the number of
>> >> > replies
>> >> > > > this
>> >> > > > > > > guy
>> >> > > > > > > received (0 !), I doubt JFlex would consider it a problem.
>> >> > But we
>> >> > > > can
>> >> > > > > > > do
>> >> > > > > > > some small service to our users base by protecting against
>> >> > such
>> >> > > > > > > problems.
>> >> > > > > > >
>> >> > > > > > > And while you're opening the issue, if you want to take a
>> >> > stab at
>> >> > > > > > > fixing it
>> >> > > > > > > and post a patch, it'd be great :).
>> >> > > > > > >
>> >> > > > > > > Shai
>> >> > > > > > >
>> >> > > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
>> >> > > > > > > <ruben.lag...@gmail.com>wrote:
>> >> > > > > > >
>> >> > > > > > > > I was investigating this a little further and in the
>> >> > > > > > > > JFlex
>> >> > > > mailing
>> >> > > > > > > list I
>> >> > > > > > > > found [1]
>> >> > > > > > > >
>> >> > > > > > > > I don't know much about flex / JFlex but it seems that
>> >> > > > > > > > this
>> >> > guy
>> >> > > > > > > resets the
>> >> > > > > > > > zzBuffer to 16384 or less when setting the input for the
>> >> > lexer
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Quoted from  shef <she...@ya...>
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > I set
>> >> > > > > > > >
>> >> > > > > > > > %buffer 0
>> >> > > > > > > >
>> >> > > > > > > > in the options section, and then added this method to the
>> >> > > > lexer:
>> >> > > > > > > >
>> >> > > > > > > >    /**
>> >> > > > > > > >     * Set the input for the lexer. The size parameter
>> >> > really
>> >> > > > speeds
>> >> > > > > > > things
>> >> > > > > > > > up,
>> >> > > > > > > >     * because by default, the lexer allocates an internal
>> >> > > > buffer of
>> >> > > > > > > 16k.
>> >> > > > > > > > For
>> >> > > > > > > >     * most strings, this is unnecessarily large. If the
>> >> > size
>> >> > > > param is
>> >> > > > > > > > 0 or greater
>> >> > > > > > > >     * than 16k, then the buffer is set to 16k. If the
>> >> > > > > > > > size
>> >> > > > param is
>> >> > > > > > > > smaller, then
>> >> > > > > > > >     * the buf will be set to the exact size.
>> >> > > > > > > >     * @param r the reader that provides the data
>> >> > > > > > > >     * @param the size of the data in the reader.
>> >> > > > > > > >     */
>> >> > > > > > > >    public void reset(Reader r, int size) {
>> >> > > > > > > >        if (size == 0 || size > 16384)
>> >> > > > > > > >            size = 16384;
>> >> > > > > > > >        zzBuffer = new char[size];
>> >> > > > > > > >        yyreset(r);
>> >> > > > > > > >    }
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > So maybe there is a way to trim the zzBuffer this way
>> >> > > > > > > > (?).
>> >> > > > > > > >
>> >> > > > > > > > BTW, I will try to find out which is the "big token" in
>> >> > > > > > > > my
>> >> > > > dataset
>> >> > > > > > > this
>> >> > > > > > > > afternoon. Thanks for the help.
>> >> > > > > > > >
>> >> > > > > > > > I actually workaround this memory problem for the time
>> >> > being by
>> >> > > > > > > wrapping
>> >> > > > > > > > the
>> >> > > > > > > > IndexWriter in a class that periodically closes the
>> >> > IndexWriter
>> >> > > > and
>> >> > > > > > > creates
>> >> > > > > > > > a new one, allowing the old to be GCed, but I would be
>> >> > really
>> >> > > > good if
>> >> > > > > > > > either
>> >> > > > > > > > JFlex or Lucene can take care of this zzBuffer going
>> >> > berserk.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Again thanks for the quick response. /Rubén
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > [1]
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > > >
>> >> >
>> >> > https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@
>> >> > > > > > > web38901.mail.mud.yahoo.com
>> >> > > > > > > >
>> >> > > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera
>> >> > <ser...@gmail.com>
>> >> > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > If we could change the Flex file so that
>> >> > > > > > > > > yyreset(Reader)
>> >> > > > would
>> >> > > > > > > check the
>> >> > > > > > > > > size of zzBuffer, we could trim it when it gets too
>> >> > > > > > > > > big.
>> >> > But
>> >> > > > I
>> >> > > > > > > don't
>> >> > > > > > > > think
>> >> > > > > > > > > we have such control when writing the flex syntax ...
>> >> > yyreset
>> >> > > > is
>> >> > > > > > > > generated
>> >> > > > > > > > > by JFlex and that's the only place I can think of to
>> >> > > > > > > > > trim
>> >> > the
>> >> > > > > > > buffer down
>> >> > > > > > > > > when it exceeds a predefined threshold ....
>> >> > > > > > > > >
>> >> > > > > > > > > Maybe what we can do is create our own method which
>> >> > > > > > > > > will
>> >> > be
>> >> > > > called
>> >> > > > > > > by
>> >> > > > > > > > > StandardTokenizer after yyreset is called, something
>> >> > > > > > > > > like
>> >> > > > > > > > > trimBufferIfTooBig(int threshold) which will reallocate
>> >> > > > zzBuffer if
>> >> > > > > > > it
>> >> > > > > > > > > exceeded the threshold. We can decide on a reasonable
>> >> > > > > > > > > 64K
>> >> > > > threshold
>> >> > > > > > > or
>> >> > > > > > > > > something, or simply always cut back to 16 KB. As far
>> >> > > > > > > > > as
>> >> > I
>> >> > > > > > > understand,
>> >> > > > > > > > that
>> >> > > > > > > > > buffer should never grow that much. I.e. in zzRefill,
>> >> > which
>> >> > > > is the
>> >> > > > > > > only
>> >> > > > > > > > > place where the buffer gets resized, there is an
>> >> > > > > > > > > attempt
>> >> > to
>> >> > > > first
>> >> > > > > > > move
>> >> > > > > > > > back
>> >> > > > > > > > > characters that were already consumed and only then
>> >> > allocate
>> >> > > > a
>> >> > > > > > > bigger
>> >> > > > > > > > > buffer. Which means only if there is a token whose size
>> >> > is
>> >> > > > larger
>> >> > > > > > > than
>> >> > > > > > > > 16KB
>> >> > > > > > > > > (!?), will this buffer get expanded.
>> >> > > > > > > > >
>> >> > > > > > > > > A trimBuffer method might not be that bad .. as a
>> >> > protective
>> >> > > > > > > measure.
>> >> > > > > > > > What
>> >> > > > > > > > > do you think? Of course, JFlex can fix it on their own
>> >> > ...
>> >> > > > but
>> >> > > > > > > until that
>> >> > > > > > > > > happens ...
>> >> > > > > > > > >
>> >> > > > > > > > > Shai
>> >> > > > > > > > >
>> >> > > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler
>> >> > > > <u...@thetaphi.de>
>> >> > > > > > > wrote:
>> >> > > > > > > > >
>> >> > > > > > > > > > > I would like to identify also the problematic
>> >> > document I
>> >> > > > have
>> >> > > > > > > 10000
>> >> > > > > > > > so,
>> >> > > > > > > > > > > what
>> >> > > > > > > > > > > would be the best way of identifying the one that
>> >> > > > > > > > > > > it
>> >> > > > making
>> >> > > > > > > zzBuffer
>> >> > > > > > > > to
>> >> > > > > > > > > > > grow
>> >> > > > > > > > > > > without control?
>> >> > > > > > > > > >
>> >> > > > > > > > > > Dont index your documents, but instead pass them
>> >> > directly
>> >> > > > to the
>> >> > > > > > > > analyzer
>> >> > > > > > > > > > and consume the tokenstream manually. Then visit
>> >> > > > > > > > > TermAttribute.termLength()
>> >> > > > > > > > > > for each Token.
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > -------------------------------------------------------
>> >> > ----
>> >> > > > ------
>> >> > > > > > > ----
>> >> > > > > > > > > > To unsubscribe, e-mail: java-user-
>> >> > > > unsubscr...@lucene.apache.org
>> >> > > > > > > > > > For additional commands, e-mail:
>> >> > > > > java-user-h...@lucene.apache.org
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > --
>> >> > > > > > > > /Rubén
>> >> > > > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >
>> >> > > > > > ---------------------------------------------------------------
>> >> > ----
>> >> > > > --
>> >> > > > > > To unsubscribe, e-mail:
>> >> > > > > > java-user-unsubscr...@lucene.apache.org
>> >> > > > > > For additional commands, e-mail: java-user-
>> >> > h...@lucene.apache.org
>> >> > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > /Rubén
>> >> > >
>> >> > >
>> >> > >
>> >> > > ---------------------------------------------------------------------
>> >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> > /Rubén
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > /Rubén
>> >
>
>
>
> --
> /Rubén
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: IndexWriter memory leak?

Reply via email to