Re: IndexWriter memory leak?

Ruben Laguna Thu, 08 Apr 2010 21:55:03 -0700

But the Readers I'm talking about are not held by the Tokenizer (at least
not *only* by it), these are held by the DocFieldProccessorPerThread....


IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState ->
DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable ->
Field (fieldsData)

and it's not only one Reader there are several one (one per thread I
suppose, in my heapdump there is 25 Reader that should have been GCed
otherwise).

Best regards/Ruben
On Thu, Apr 8, 2010 at 11:49 PM, Uwe Schindler <[email protected]> wrote:

> There is one possibility, that could be fixed:
>
> As Tokenizers are reused, the analyzer holds a reference to the last used
> Reader. The easy fix would be to unset the Reader in Tokenizer.close(). If
> this is the case for you, that may be easy to do. So Tokenizer.close() looks
> like this:
>
>  /** By default, closes the input Reader. */
>  @Override
>  public void close() throws IOException {
>    input.close();
>    input = null; // <-- new!
>  }
>
> Any comments from other committers?
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
> > -----Original Message-----
> > From: Ruben Laguna [mailto:[email protected]]
> > Sent: Thursday, April 08, 2010 2:50 PM
> > To: [email protected]
> > Subject: Re: IndexWriter memory leak?
> >
> > I will double check in the afternoon the heapdump.hprof. But I think
> > that
> > *some* readers are indeed held by
> > docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx],
> > as shown in [1] (this heapdump contains only live objects).  The
> > heapdump
> > was taken after IndexWriter.commit() /IndexWriter.optimize() and all
> > the
> > Documents were already indexed and GCed (I will double check).
> >
> > So that would mean that the Reader is retained in memory by the
> > following
> > chaing of references,
> >
> > DocumentsWriter -> DocumentsWriterThreadState ->
> > DocFieldProcessorPerThread
> > -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData)
> >
> > I'll double check with Eclipse MAT as I said that this chain is
> > actually
> > made of hard references only (no SoftReferences,WeakReferences, etc). I
> > will
> > also double check also that there is no "live" Document that is
> > referencing
> > the Reader via the Field.
> >
> >
> > [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
> >
> > On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <[email protected]> wrote:
> >
> > > Readers are not held. If you indexed the document and gced the
> > document
> > > instance they readers are gone.
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: [email protected]
> > >
> > >
> > > > -----Original Message-----
> > > > From: Ruben Laguna [mailto:[email protected]]
> > > > Sent: Thursday, April 08, 2010 1:28 PM
> > > > To: [email protected]
> > > > Subject: Re: IndexWriter memory leak?
> > > >
> > > > Now that the zzBuffer issue is solved...
> > > >
> > > > what about the references to the Readers held by docWriter. Tika´s
> > > > ParsingReaders are quite heavyweight so retaining those in memory
> > > > unnecesarily is also a "hidden" memory leak. Should I open a bug
> > report
> > > > on
> > > > that one?
> > > >
> > > > /Rubén
> > > >
> > > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <[email protected]>
> > wrote:
> > > >
> > > > > Guess we were replying at the same time :).
> > > > >
> > > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <[email protected]>
> > > > wrote:
> > > > >
> > > > > > I already answered, that I will take care of this!
> > > > > >
> > > > > > Uwe
> > > > > >
> > > > > > -----
> > > > > > Uwe Schindler
> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > > http://www.thetaphi.de
> > > > > > eMail: [email protected]
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Shai Erera [mailto:[email protected]]
> > > > > > > Sent: Thursday, April 08, 2010 12:00 PM
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: IndexWriter memory leak?
> > > > > > >
> > > > > > > Yes, that's the trimBuffer version I was thinking about, only
> > > > this guy
> > > > > > > created a reset(Reader, int) and does both ops (resetting +
> > trim)
> > > > in
> > > > > > > one
> > > > > > > method call. More convenient. Can you please open an issue to
> > > > track
> > > > > > > that?
> > > > > > > People will have a chance to comment on whether we (Lucene)
> > > > should
> > > > > > > handle
> > > > > > > that, or it should be a JFlex fix. Based on the number of
> > replies
> > > > this
> > > > > > > guy
> > > > > > > received (0 !), I doubt JFlex would consider it a problem.
> > But we
> > > > can
> > > > > > > do
> > > > > > > some small service to our users base by protecting against
> > such
> > > > > > > problems.
> > > > > > >
> > > > > > > And while you're opening the issue, if you want to take a
> > stab at
> > > > > > > fixing it
> > > > > > > and post a patch, it'd be great :).
> > > > > > >
> > > > > > > Shai
> > > > > > >
> > > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
> > > > > > > <[email protected]>wrote:
> > > > > > >
> > > > > > > > I was investigating this a little further and in the JFlex
> > > > mailing
> > > > > > > list I
> > > > > > > > found [1]
> > > > > > > >
> > > > > > > > I don't know much about flex / JFlex but it seems that this
> > guy
> > > > > > > resets the
> > > > > > > > zzBuffer to 16384 or less when setting the input for the
> > lexer
> > > > > > > >
> > > > > > > >
> > > > > > > > Quoted from  shef <she...@ya...>
> > > > > > > >
> > > > > > > >
> > > > > > > > I set
> > > > > > > >
> > > > > > > > %buffer 0
> > > > > > > >
> > > > > > > > in the options section, and then added this method to the
> > > > lexer:
> > > > > > > >
> > > > > > > >    /**
> > > > > > > >     * Set the input for the lexer. The size parameter
> > really
> > > > speeds
> > > > > > > things
> > > > > > > > up,
> > > > > > > >     * because by default, the lexer allocates an internal
> > > > buffer of
> > > > > > > 16k.
> > > > > > > > For
> > > > > > > >     * most strings, this is unnecessarily large. If the
> > size
> > > > param is
> > > > > > > > 0 or greater
> > > > > > > >     * than 16k, then the buffer is set to 16k. If the size
> > > > param is
> > > > > > > > smaller, then
> > > > > > > >     * the buf will be set to the exact size.
> > > > > > > >     * @param r the reader that provides the data
> > > > > > > >     * @param the size of the data in the reader.
> > > > > > > >     */
> > > > > > > >    public void reset(Reader r, int size) {
> > > > > > > >        if (size == 0 || size > 16384)
> > > > > > > >            size = 16384;
> > > > > > > >        zzBuffer = new char[size];
> > > > > > > >        yyreset(r);
> > > > > > > >    }
> > > > > > > >
> > > > > > > >
> > > > > > > > So maybe there is a way to trim the zzBuffer this way (?).
> > > > > > > >
> > > > > > > > BTW, I will try to find out which is the "big token" in my
> > > > dataset
> > > > > > > this
> > > > > > > > afternoon. Thanks for the help.
> > > > > > > >
> > > > > > > > I actually workaround this memory problem for the time
> > being by
> > > > > > > wrapping
> > > > > > > > the
> > > > > > > > IndexWriter in a class that periodically closes the
> > IndexWriter
> > > > and
> > > > > > > creates
> > > > > > > > a new one, allowing the old to be GCed, but I would be
> > really
> > > > good if
> > > > > > > > either
> > > > > > > > JFlex or Lucene can take care of this zzBuffer going
> > berserk.
> > > > > > > >
> > > > > > > >
> > > > > > > > Again thanks for the quick response. /Rubén
> > > > > > > >
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@
> > > > > > > web38901.mail.mud.yahoo.com
> > > > > > > >
> > > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera
> > <[email protected]>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > If we could change the Flex file so that yyreset(Reader)
> > > > would
> > > > > > > check the
> > > > > > > > > size of zzBuffer, we could trim it when it gets too big.
> > But
> > > > I
> > > > > > > don't
> > > > > > > > think
> > > > > > > > > we have such control when writing the flex syntax ...
> > yyreset
> > > > is
> > > > > > > > generated
> > > > > > > > > by JFlex and that's the only place I can think of to trim
> > the
> > > > > > > buffer down
> > > > > > > > > when it exceeds a predefined threshold ....
> > > > > > > > >
> > > > > > > > > Maybe what we can do is create our own method which will
> > be
> > > > called
> > > > > > > by
> > > > > > > > > StandardTokenizer after yyreset is called, something like
> > > > > > > > > trimBufferIfTooBig(int threshold) which will reallocate
> > > > zzBuffer if
> > > > > > > it
> > > > > > > > > exceeded the threshold. We can decide on a reasonable 64K
> > > > threshold
> > > > > > > or
> > > > > > > > > something, or simply always cut back to 16 KB. As far as
> > I
> > > > > > > understand,
> > > > > > > > that
> > > > > > > > > buffer should never grow that much. I.e. in zzRefill,
> > which
> > > > is the
> > > > > > > only
> > > > > > > > > place where the buffer gets resized, there is an attempt
> > to
> > > > first
> > > > > > > move
> > > > > > > > back
> > > > > > > > > characters that were already consumed and only then
> > allocate
> > > > a
> > > > > > > bigger
> > > > > > > > > buffer. Which means only if there is a token whose size
> > is
> > > > larger
> > > > > > > than
> > > > > > > > 16KB
> > > > > > > > > (!?), will this buffer get expanded.
> > > > > > > > >
> > > > > > > > > A trimBuffer method might not be that bad .. as a
> > protective
> > > > > > > measure.
> > > > > > > > What
> > > > > > > > > do you think? Of course, JFlex can fix it on their own
> > ...
> > > > but
> > > > > > > until that
> > > > > > > > > happens ...
> > > > > > > > >
> > > > > > > > > Shai
> > > > > > > > >
> > > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler
> > > > <[email protected]>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > > I would like to identify also the problematic
> > document I
> > > > have
> > > > > > > 10000
> > > > > > > > so,
> > > > > > > > > > > what
> > > > > > > > > > > would be the best way of identifying the one that it
> > > > making
> > > > > > > zzBuffer
> > > > > > > > to
> > > > > > > > > > > grow
> > > > > > > > > > > without control?
> > > > > > > > > >
> > > > > > > > > > Dont index your documents, but instead pass them
> > directly
> > > > to the
> > > > > > > > analyzer
> > > > > > > > > > and consume the tokenstream manually. Then visit
> > > > > > > > > TermAttribute.termLength()
> > > > > > > > > > for each Token.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -------------------------------------------------------
> > ----
> > > > ------
> > > > > > > ----
> > > > > > > > > > To unsubscribe, e-mail: java-user-
> > > > [email protected]
> > > > > > > > > > For additional commands, e-mail:
> > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > /Rubén
> > > > > > > >
> > > > > >
> > > > > >
> > > > > > ---------------------------------------------------------------
> > ----
> > > > --
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: java-user-
> > [email protected]
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > /Rubén
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> >
> > --
> > /Rubén
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
/Rubén

Re: IndexWriter memory leak?

Reply via email to