Re: IndexWriter memory leak?

Ruben Laguna Thu, 08 Apr 2010 04:55:16 -0700

And by the way, when is Lucene 3.1 coming?

On Thu, Apr 8, 2010 at 1:27 PM, Ruben Laguna <ruben.lag...@gmail.com> wrote:


> Now that the zzBuffer issue is solved...
>
> what about the references to the Readers held by docWriter. Tika´s
> ParsingReaders are quite heavyweight so retaining those in memory
> unnecesarily is also a "hidden" memory leak. Should I open a bug report on
> that one?
>
> /Rubén
>
> On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <ser...@gmail.com> wrote:
>
>> Guess we were replying at the same time :).
>>
>> On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <u...@thetaphi.de> wrote:
>>
>> > I already answered, that I will take care of this!
>> >
>> > Uwe
>> >
>> > -----
>> > Uwe Schindler
>> > H.-H.-Meier-Allee 63, D-28213 Bremen
>> > http://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>> >
>> >
>> > > -----Original Message-----
>> > > From: Shai Erera [mailto:ser...@gmail.com]
>> > > Sent: Thursday, April 08, 2010 12:00 PM
>> > > To: java-user@lucene.apache.org
>> > > Subject: Re: IndexWriter memory leak?
>> > >
>> > > Yes, that's the trimBuffer version I was thinking about, only this guy
>> > > created a reset(Reader, int) and does both ops (resetting + trim) in
>> > > one
>> > > method call. More convenient. Can you please open an issue to track
>> > > that?
>> > > People will have a chance to comment on whether we (Lucene) should
>> > > handle
>> > > that, or it should be a JFlex fix. Based on the number of replies this
>> > > guy
>> > > received (0 !), I doubt JFlex would consider it a problem. But we can
>> > > do
>> > > some small service to our users base by protecting against such
>> > > problems.
>> > >
>> > > And while you're opening the issue, if you want to take a stab at
>> > > fixing it
>> > > and post a patch, it'd be great :).
>> > >
>> > > Shai
>> > >
>> > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna
>> > > <ruben.lag...@gmail.com>wrote:
>> > >
>> > > > I was investigating this a little further and in the JFlex mailing
>> > > list I
>> > > > found [1]
>> > > >
>> > > > I don't know much about flex / JFlex but it seems that this guy
>> > > resets the
>> > > > zzBuffer to 16384 or less when setting the input for the lexer
>> > > >
>> > > >
>> > > > Quoted from  shef <she...@ya...>
>> > > >
>> > > >
>> > > > I set
>> > > >
>> > > > %buffer 0
>> > > >
>> > > > in the options section, and then added this method to the lexer:
>> > > >
>> > > >    /**
>> > > >     * Set the input for the lexer. The size parameter really speeds
>> > > things
>> > > > up,
>> > > >     * because by default, the lexer allocates an internal buffer of
>> > > 16k.
>> > > > For
>> > > >     * most strings, this is unnecessarily large. If the size param
>> is
>> > > > 0 or greater
>> > > >     * than 16k, then the buffer is set to 16k. If the size param is
>> > > > smaller, then
>> > > >     * the buf will be set to the exact size.
>> > > >     * @param r the reader that provides the data
>> > > >     * @param the size of the data in the reader.
>> > > >     */
>> > > >    public void reset(Reader r, int size) {
>> > > >        if (size == 0 || size > 16384)
>> > > >            size = 16384;
>> > > >        zzBuffer = new char[size];
>> > > >        yyreset(r);
>> > > >    }
>> > > >
>> > > >
>> > > > So maybe there is a way to trim the zzBuffer this way (?).
>> > > >
>> > > > BTW, I will try to find out which is the "big token" in my dataset
>> > > this
>> > > > afternoon. Thanks for the help.
>> > > >
>> > > > I actually workaround this memory problem for the time being by
>> > > wrapping
>> > > > the
>> > > > IndexWriter in a class that periodically closes the IndexWriter and
>> > > creates
>> > > > a new one, allowing the old to be GCed, but I would be really good
>> if
>> > > > either
>> > > > JFlex or Lucene can take care of this zzBuffer going berserk.
>> > > >
>> > > >
>> > > > Again thanks for the quick response. /Rubén
>> > > >
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@
>> > > web38901.mail.mud.yahoo.com
>> > > >
>> > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera <ser...@gmail.com>
>> wrote:
>> > > >
>> > > > > If we could change the Flex file so that yyreset(Reader) would
>> > > check the
>> > > > > size of zzBuffer, we could trim it when it gets too big. But I
>> > > don't
>> > > > think
>> > > > > we have such control when writing the flex syntax ... yyreset is
>> > > > generated
>> > > > > by JFlex and that's the only place I can think of to trim the
>> > > buffer down
>> > > > > when it exceeds a predefined threshold ....
>> > > > >
>> > > > > Maybe what we can do is create our own method which will be called
>> > > by
>> > > > > StandardTokenizer after yyreset is called, something like
>> > > > > trimBufferIfTooBig(int threshold) which will reallocate zzBuffer
>> if
>> > > it
>> > > > > exceeded the threshold. We can decide on a reasonable 64K
>> threshold
>> > > or
>> > > > > something, or simply always cut back to 16 KB. As far as I
>> > > understand,
>> > > > that
>> > > > > buffer should never grow that much. I.e. in zzRefill, which is the
>> > > only
>> > > > > place where the buffer gets resized, there is an attempt to first
>> > > move
>> > > > back
>> > > > > characters that were already consumed and only then allocate a
>> > > bigger
>> > > > > buffer. Which means only if there is a token whose size is larger
>> > > than
>> > > > 16KB
>> > > > > (!?), will this buffer get expanded.
>> > > > >
>> > > > > A trimBuffer method might not be that bad .. as a protective
>> > > measure.
>> > > > What
>> > > > > do you think? Of course, JFlex can fix it on their own ... but
>> > > until that
>> > > > > happens ...
>> > > > >
>> > > > > Shai
>> > > > >
>> > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler <u...@thetaphi.de>
>> > > wrote:
>> > > > >
>> > > > > > > I would like to identify also the problematic document I have
>> > > 10000
>> > > > so,
>> > > > > > > what
>> > > > > > > would be the best way of identifying the one that it making
>> > > zzBuffer
>> > > > to
>> > > > > > > grow
>> > > > > > > without control?
>> > > > > >
>> > > > > > Dont index your documents, but instead pass them directly to the
>> > > > analyzer
>> > > > > > and consume the tokenstream manually. Then visit
>> > > > > TermAttribute.termLength()
>> > > > > > for each Token.
>> > > > > >
>> > > > > >
>> > > > > >
>> -----------------------------------------------------------------
>> > > ----
>> > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > > > > > For additional commands, e-mail:
>> java-user-h...@lucene.apache.org
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > /Rubén
>> > > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>
>
>
> --
> /Rubén
>



-- 
/Rubén

Re: IndexWriter memory leak?

Reply via email to