RE: IndexWriter memory leak?

Uwe Schindler Thu, 08 Apr 2010 14:50:06 -0700

There is one possibility, that could be fixed:

As Tokenizers are reused, the analyzer holds a reference to the last used 
Reader. The easy fix would be to unset the Reader in Tokenizer.close(). If this 
is the case for you, that may be easy to do. So Tokenizer.close() looks like 
this:


  /** By default, closes the input Reader. */
  @Override
  public void close() throws IOException {
    input.close();
    input = null; // <-- new!
  }

Any comments from other committers?

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Ruben Laguna [mailto:[email protected]]
> Sent: Thursday, April 08, 2010 2:50 PM
> To: [email protected]
> Subject: Re: IndexWriter memory leak?
> 
> I will double check in the afternoon the heapdump.hprof. But I think
> that
> *some* readers are indeed held by
> docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx],
> as shown in [1] (this heapdump contains only live objects).  The
> heapdump
> was taken after IndexWriter.commit() /IndexWriter.optimize() and all
> the
> Documents were already indexed and GCed (I will double check).
> 
> So that would mean that the Reader is retained in memory by the
> following
> chaing of references,
> 
> DocumentsWriter -> DocumentsWriterThreadState ->
> DocFieldProcessorPerThread
> -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData)
> 
> I'll double check with Eclipse MAT as I said that this chain is
> actually
> made of hard references only (no SoftReferences,WeakReferences, etc). I
> will
> also double check also that there is no "live" Document that is
> referencing
> the Reader via the Field.
> 
> 
> [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg
> 
> On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <[email protected]> wrote:
> 
> > Readers are not held. If you indexed the document and gced the
> document
> > instance they readers are gone.
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [email protected]
> >
> >
> > > -----Original Message-----
> > > From: Ruben Laguna [mailto:[email protected]]
> > > Sent: Thursday, April 08, 2010 1:28 PM
> > > To: [email protected]
> > > Subject: Re: IndexWriter memory leak?
> > >
> > > Now that the zzBuffer issue is solved...
> > >
> > > what about the references to the Readers held by docWriter. Tika´s
> > > ParsingReaders are quite heavyweight so retaining those in memory
> > > unnecesarily is also a "hidden" memory leak. Should I open a bug
> report
> > > on
> > > that one?
> > >
> > > /Rubén
> > >
> > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <[email protected]>
> wrote:
> > >
> > > > Guess we were replying at the same time :).
> > > >
> > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <[email protected]>
> > > wrote:
> > > >
> > > > > I already answered, that I will take care of this!
> > > > >
> > > > > Uwe
> > > > >
> > > > > -----
> > > > > Uwe Schindler
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > http://www.thetaphi.de
> > > > > eMail: [email protected]
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Shai Erera [mailto:[email protected]]
> > > > > > Sent: Thursday, April 08, 2010 12:00 PM
> > > > > > To: [email protected]
> > > > > > Subject: Re: IndexWriter memory leak?
> > > > > >
> > > > > > Yes, that's the trimBuffer version I was thinking about, only
> > > this guy
> > > > > > created a reset(Reader, int) and does both ops (resetting +
> trim)
> > > in
> > > > > > one
> > > > > > method call. More convenient. Can you please open an issue to
> > > track
> > > > > > that?
> > > > > > People will have a chance to comment on whether we (Lucene)
> > > should
> > > > > > handle
> > > > > > that, or it should be a JFlex fix. Based on the number of
> replies
> > > this
> > > > > > guy
> > > > > > received (0 !), I doubt JFlex would consider it a problem.
> But we
> > > can
> > > > > > do
> > > > > > some small service to our users base by protecting against
> such
> > > > > > problems.
> > > > > >
> > > > > > And while you're opening the issue, if you want to take a
> stab at
> > > > > > fixing it
> > > > > > and post a patch, it'd be great :).
> > > > > >
> > > > > > Shai
> > > > > >
> > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
> > > > > > <[email protected]>wrote:
> > > > > >
> > > > > > > I was investigating this a little further and in the JFlex
> > > mailing
> > > > > > list I
> > > > > > > found [1]
> > > > > > >
> > > > > > > I don't know much about flex / JFlex but it seems that this
> guy
> > > > > > resets the
> > > > > > > zzBuffer to 16384 or less when setting the input for the
> lexer
> > > > > > >
> > > > > > >
> > > > > > > Quoted from  shef <she...@ya...>
> > > > > > >
> > > > > > >
> > > > > > > I set
> > > > > > >
> > > > > > > %buffer 0
> > > > > > >
> > > > > > > in the options section, and then added this method to the
> > > lexer:
> > > > > > >
> > > > > > >    /**
> > > > > > >     * Set the input for the lexer. The size parameter
> really
> > > speeds
> > > > > > things
> > > > > > > up,
> > > > > > >     * because by default, the lexer allocates an internal
> > > buffer of
> > > > > > 16k.
> > > > > > > For
> > > > > > >     * most strings, this is unnecessarily large. If the
> size
> > > param is
> > > > > > > 0 or greater
> > > > > > >     * than 16k, then the buffer is set to 16k. If the size
> > > param is
> > > > > > > smaller, then
> > > > > > >     * the buf will be set to the exact size.
> > > > > > >     * @param r the reader that provides the data
> > > > > > >     * @param the size of the data in the reader.
> > > > > > >     */
> > > > > > >    public void reset(Reader r, int size) {
> > > > > > >        if (size == 0 || size > 16384)
> > > > > > >            size = 16384;
> > > > > > >        zzBuffer = new char[size];
> > > > > > >        yyreset(r);
> > > > > > >    }
> > > > > > >
> > > > > > >
> > > > > > > So maybe there is a way to trim the zzBuffer this way (?).
> > > > > > >
> > > > > > > BTW, I will try to find out which is the "big token" in my
> > > dataset
> > > > > > this
> > > > > > > afternoon. Thanks for the help.
> > > > > > >
> > > > > > > I actually workaround this memory problem for the time
> being by
> > > > > > wrapping
> > > > > > > the
> > > > > > > IndexWriter in a class that periodically closes the
> IndexWriter
> > > and
> > > > > > creates
> > > > > > > a new one, allowing the old to be GCed, but I would be
> really
> > > good if
> > > > > > > either
> > > > > > > JFlex or Lucene can take care of this zzBuffer going
> berserk.
> > > > > > >
> > > > > > >
> > > > > > > Again thanks for the quick response. /Rubén
> > > > > > >
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@
> > > > > > web38901.mail.mud.yahoo.com
> > > > > > >
> > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera
> <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > If we could change the Flex file so that yyreset(Reader)
> > > would
> > > > > > check the
> > > > > > > > size of zzBuffer, we could trim it when it gets too big.
> But
> > > I
> > > > > > don't
> > > > > > > think
> > > > > > > > we have such control when writing the flex syntax ...
> yyreset
> > > is
> > > > > > > generated
> > > > > > > > by JFlex and that's the only place I can think of to trim
> the
> > > > > > buffer down
> > > > > > > > when it exceeds a predefined threshold ....
> > > > > > > >
> > > > > > > > Maybe what we can do is create our own method which will
> be
> > > called
> > > > > > by
> > > > > > > > StandardTokenizer after yyreset is called, something like
> > > > > > > > trimBufferIfTooBig(int threshold) which will reallocate
> > > zzBuffer if
> > > > > > it
> > > > > > > > exceeded the threshold. We can decide on a reasonable 64K
> > > threshold
> > > > > > or
> > > > > > > > something, or simply always cut back to 16 KB. As far as
> I
> > > > > > understand,
> > > > > > > that
> > > > > > > > buffer should never grow that much. I.e. in zzRefill,
> which
> > > is the
> > > > > > only
> > > > > > > > place where the buffer gets resized, there is an attempt
> to
> > > first
> > > > > > move
> > > > > > > back
> > > > > > > > characters that were already consumed and only then
> allocate
> > > a
> > > > > > bigger
> > > > > > > > buffer. Which means only if there is a token whose size
> is
> > > larger
> > > > > > than
> > > > > > > 16KB
> > > > > > > > (!?), will this buffer get expanded.
> > > > > > > >
> > > > > > > > A trimBuffer method might not be that bad .. as a
> protective
> > > > > > measure.
> > > > > > > What
> > > > > > > > do you think? Of course, JFlex can fix it on their own
> ...
> > > but
> > > > > > until that
> > > > > > > > happens ...
> > > > > > > >
> > > > > > > > Shai
> > > > > > > >
> > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler
> > > <[email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > > I would like to identify also the problematic
> document I
> > > have
> > > > > > 10000
> > > > > > > so,
> > > > > > > > > > what
> > > > > > > > > > would be the best way of identifying the one that it
> > > making
> > > > > > zzBuffer
> > > > > > > to
> > > > > > > > > > grow
> > > > > > > > > > without control?
> > > > > > > > >
> > > > > > > > > Dont index your documents, but instead pass them
> directly
> > > to the
> > > > > > > analyzer
> > > > > > > > > and consume the tokenstream manually. Then visit
> > > > > > > > TermAttribute.termLength()
> > > > > > > > > for each Token.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -------------------------------------------------------
> ----
> > > ------
> > > > > > ----
> > > > > > > > > To unsubscribe, e-mail: java-user-
> > > [email protected]
> > > > > > > > > For additional commands, e-mail:
> > > > [email protected]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > /Rubén
> > > > > > >
> > > > >
> > > > >
> > > > > ---------------------------------------------------------------
> ----
> > > --
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: java-user-
> [email protected]
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > /Rubén
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 
> 
> --
> /Rubén


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: IndexWriter memory leak?

Reply via email to