And by the way, when is Lucene 3.1 coming? On Thu, Apr 8, 2010 at 1:27 PM, Ruben Laguna <ruben.lag...@gmail.com> wrote:
> Now that the zzBuffer issue is solved... > > what about the references to the Readers held by docWriter. Tika´s > ParsingReaders are quite heavyweight so retaining those in memory > unnecesarily is also a "hidden" memory leak. Should I open a bug report on > that one? > > /Rubén > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <ser...@gmail.com> wrote: > >> Guess we were replying at the same time :). >> >> On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <u...@thetaphi.de> wrote: >> >> > I already answered, that I will take care of this! >> > >> > Uwe >> > >> > ----- >> > Uwe Schindler >> > H.-H.-Meier-Allee 63, D-28213 Bremen >> > http://www.thetaphi.de >> > eMail: u...@thetaphi.de >> > >> > >> > > -----Original Message----- >> > > From: Shai Erera [mailto:ser...@gmail.com] >> > > Sent: Thursday, April 08, 2010 12:00 PM >> > > To: java-user@lucene.apache.org >> > > Subject: Re: IndexWriter memory leak? >> > > >> > > Yes, that's the trimBuffer version I was thinking about, only this guy >> > > created a reset(Reader, int) and does both ops (resetting + trim) in >> > > one >> > > method call. More convenient. Can you please open an issue to track >> > > that? >> > > People will have a chance to comment on whether we (Lucene) should >> > > handle >> > > that, or it should be a JFlex fix. Based on the number of replies this >> > > guy >> > > received (0 !), I doubt JFlex would consider it a problem. But we can >> > > do >> > > some small service to our users base by protecting against such >> > > problems. >> > > >> > > And while you're opening the issue, if you want to take a stab at >> > > fixing it >> > > and post a patch, it'd be great :). >> > > >> > > Shai >> > > >> > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna >> > > <ruben.lag...@gmail.com>wrote: >> > > >> > > > I was investigating this a little further and in the JFlex mailing >> > > list I >> > > > found [1] >> > > > >> > > > I don't know much about flex / JFlex but it seems that this guy >> > > resets the >> > > > zzBuffer to 16384 or less when setting the input for the lexer >> > > > >> > > > >> > > > Quoted from shef <she...@ya...> >> > > > >> > > > >> > > > I set >> > > > >> > > > %buffer 0 >> > > > >> > > > in the options section, and then added this method to the lexer: >> > > > >> > > > /** >> > > > * Set the input for the lexer. The size parameter really speeds >> > > things >> > > > up, >> > > > * because by default, the lexer allocates an internal buffer of >> > > 16k. >> > > > For >> > > > * most strings, this is unnecessarily large. If the size param >> is >> > > > 0 or greater >> > > > * than 16k, then the buffer is set to 16k. If the size param is >> > > > smaller, then >> > > > * the buf will be set to the exact size. >> > > > * @param r the reader that provides the data >> > > > * @param the size of the data in the reader. >> > > > */ >> > > > public void reset(Reader r, int size) { >> > > > if (size == 0 || size > 16384) >> > > > size = 16384; >> > > > zzBuffer = new char[size]; >> > > > yyreset(r); >> > > > } >> > > > >> > > > >> > > > So maybe there is a way to trim the zzBuffer this way (?). >> > > > >> > > > BTW, I will try to find out which is the "big token" in my dataset >> > > this >> > > > afternoon. Thanks for the help. >> > > > >> > > > I actually workaround this memory problem for the time being by >> > > wrapping >> > > > the >> > > > IndexWriter in a class that periodically closes the IndexWriter and >> > > creates >> > > > a new one, allowing the old to be GCed, but I would be really good >> if >> > > > either >> > > > JFlex or Lucene can take care of this zzBuffer going berserk. >> > > > >> > > > >> > > > Again thanks for the quick response. /Rubén >> > > > >> > > > >> > > > [1] >> > > > >> > > > >> > > >> https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@ >> > > web38901.mail.mud.yahoo.com >> > > > >> > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera <ser...@gmail.com> >> wrote: >> > > > >> > > > > If we could change the Flex file so that yyreset(Reader) would >> > > check the >> > > > > size of zzBuffer, we could trim it when it gets too big. But I >> > > don't >> > > > think >> > > > > we have such control when writing the flex syntax ... yyreset is >> > > > generated >> > > > > by JFlex and that's the only place I can think of to trim the >> > > buffer down >> > > > > when it exceeds a predefined threshold .... >> > > > > >> > > > > Maybe what we can do is create our own method which will be called >> > > by >> > > > > StandardTokenizer after yyreset is called, something like >> > > > > trimBufferIfTooBig(int threshold) which will reallocate zzBuffer >> if >> > > it >> > > > > exceeded the threshold. We can decide on a reasonable 64K >> threshold >> > > or >> > > > > something, or simply always cut back to 16 KB. As far as I >> > > understand, >> > > > that >> > > > > buffer should never grow that much. I.e. in zzRefill, which is the >> > > only >> > > > > place where the buffer gets resized, there is an attempt to first >> > > move >> > > > back >> > > > > characters that were already consumed and only then allocate a >> > > bigger >> > > > > buffer. Which means only if there is a token whose size is larger >> > > than >> > > > 16KB >> > > > > (!?), will this buffer get expanded. >> > > > > >> > > > > A trimBuffer method might not be that bad .. as a protective >> > > measure. >> > > > What >> > > > > do you think? Of course, JFlex can fix it on their own ... but >> > > until that >> > > > > happens ... >> > > > > >> > > > > Shai >> > > > > >> > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler <u...@thetaphi.de> >> > > wrote: >> > > > > >> > > > > > > I would like to identify also the problematic document I have >> > > 10000 >> > > > so, >> > > > > > > what >> > > > > > > would be the best way of identifying the one that it making >> > > zzBuffer >> > > > to >> > > > > > > grow >> > > > > > > without control? >> > > > > > >> > > > > > Dont index your documents, but instead pass them directly to the >> > > > analyzer >> > > > > > and consume the tokenstream manually. Then visit >> > > > > TermAttribute.termLength() >> > > > > > for each Token. >> > > > > > >> > > > > > >> > > > > > >> ----------------------------------------------------------------- >> > > ---- >> > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > > > > For additional commands, e-mail: >> java-user-h...@lucene.apache.org >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > /Rubén >> > > > >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > > > > -- > /Rubén > -- /Rubén