Thank you for isolating & raising the issue!! Mike
On Sat, Apr 10, 2010 at 1:28 AM, Ruben Laguna <ruben.lag...@gmail.com> wrote: > I just tried the changes that you commited it works beatifully. The readers > are GCed. Thanks for both LUCENE-2387 and LUCENE-2384. Those make a big > difference in my app! > > On Fri, Apr 9, 2010 at 12:32 PM, Michael McCandless > <luc...@mikemccandless.com> wrote: >> >> I agree IW should not hold refs to the Field instances from the last >> doc indexed... I put a patch on LUCENE-2387 to null the reference as >> we go. Can you confirm this lets GC reclaim? >> >> Mike >> >> On Fri, Apr 9, 2010 at 12:54 AM, Ruben Laguna <ruben.lag...@gmail.com> >> wrote: >> > But the Readers I'm talking about are not held by the Tokenizer (at >> > least >> > not *only* by it), these are held by the DocFieldProccessorPerThread.... >> > >> > IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState -> >> > DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable -> >> > Field (fieldsData) >> > >> > and it's not only one Reader there are several one (one per thread I >> > suppose, in my heapdump there is 25 Reader that should have been GCed >> > otherwise). >> > >> > Best regards/Ruben >> > On Thu, Apr 8, 2010 at 11:49 PM, Uwe Schindler <u...@thetaphi.de> wrote: >> > >> >> There is one possibility, that could be fixed: >> >> >> >> As Tokenizers are reused, the analyzer holds a reference to the last >> >> used >> >> Reader. The easy fix would be to unset the Reader in Tokenizer.close(). >> >> If >> >> this is the case for you, that may be easy to do. So Tokenizer.close() >> >> looks >> >> like this: >> >> >> >> /** By default, closes the input Reader. */ >> >> �...@override >> >> public void close() throws IOException { >> >> input.close(); >> >> input = null; // <-- new! >> >> } >> >> >> >> Any comments from other committers? >> >> >> >> ----- >> >> Uwe Schindler >> >> H.-H.-Meier-Allee 63, D-28213 Bremen >> >> http://www.thetaphi.de >> >> eMail: u...@thetaphi.de >> >> >> >> >> >> > -----Original Message----- >> >> > From: Ruben Laguna [mailto:ruben.lag...@gmail.com] >> >> > Sent: Thursday, April 08, 2010 2:50 PM >> >> > To: java-u...@lucene.apache.org >> >> > Subject: Re: IndexWriter memory leak? >> >> > >> >> > I will double check in the afternoon the heapdump.hprof. But I think >> >> > that >> >> > *some* readers are indeed held by >> >> > docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx], >> >> > as shown in [1] (this heapdump contains only live objects). The >> >> > heapdump >> >> > was taken after IndexWriter.commit() /IndexWriter.optimize() and all >> >> > the >> >> > Documents were already indexed and GCed (I will double check). >> >> > >> >> > So that would mean that the Reader is retained in memory by the >> >> > following >> >> > chaing of references, >> >> > >> >> > DocumentsWriter -> DocumentsWriterThreadState -> >> >> > DocFieldProcessorPerThread >> >> > -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData) >> >> > >> >> > I'll double check with Eclipse MAT as I said that this chain is >> >> > actually >> >> > made of hard references only (no SoftReferences,WeakReferences, etc). >> >> > I >> >> > will >> >> > also double check also that there is no "live" Document that is >> >> > referencing >> >> > the Reader via the Field. >> >> > >> >> > >> >> > [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg >> >> > >> >> > On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <u...@thetaphi.de> >> >> > wrote: >> >> > >> >> > > Readers are not held. If you indexed the document and gced the >> >> > document >> >> > > instance they readers are gone. >> >> > > >> >> > > ----- >> >> > > Uwe Schindler >> >> > > H.-H.-Meier-Allee 63, D-28213 Bremen >> >> > > http://www.thetaphi.de >> >> > > eMail: u...@thetaphi.de >> >> > > >> >> > > >> >> > > > -----Original Message----- >> >> > > > From: Ruben Laguna [mailto:ruben.lag...@gmail.com] >> >> > > > Sent: Thursday, April 08, 2010 1:28 PM >> >> > > > To: java-u...@lucene.apache.org >> >> > > > Subject: Re: IndexWriter memory leak? >> >> > > > >> >> > > > Now that the zzBuffer issue is solved... >> >> > > > >> >> > > > what about the references to the Readers held by docWriter. >> >> > > > Tika´s >> >> > > > ParsingReaders are quite heavyweight so retaining those in memory >> >> > > > unnecesarily is also a "hidden" memory leak. Should I open a bug >> >> > report >> >> > > > on >> >> > > > that one? >> >> > > > >> >> > > > /Rubén >> >> > > > >> >> > > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <ser...@gmail.com> >> >> > wrote: >> >> > > > >> >> > > > > Guess we were replying at the same time :). >> >> > > > > >> >> > > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <u...@thetaphi.de> >> >> > > > wrote: >> >> > > > > >> >> > > > > > I already answered, that I will take care of this! >> >> > > > > > >> >> > > > > > Uwe >> >> > > > > > >> >> > > > > > ----- >> >> > > > > > Uwe Schindler >> >> > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen >> >> > > > > > http://www.thetaphi.de >> >> > > > > > eMail: u...@thetaphi.de >> >> > > > > > >> >> > > > > > >> >> > > > > > > -----Original Message----- >> >> > > > > > > From: Shai Erera [mailto:ser...@gmail.com] >> >> > > > > > > Sent: Thursday, April 08, 2010 12:00 PM >> >> > > > > > > To: java-u...@lucene.apache.org >> >> > > > > > > Subject: Re: IndexWriter memory leak? >> >> > > > > > > >> >> > > > > > > Yes, that's the trimBuffer version I was thinking about, >> >> > > > > > > only >> >> > > > this guy >> >> > > > > > > created a reset(Reader, int) and does both ops (resetting + >> >> > trim) >> >> > > > in >> >> > > > > > > one >> >> > > > > > > method call. More convenient. Can you please open an issue >> >> > > > > > > to >> >> > > > track >> >> > > > > > > that? >> >> > > > > > > People will have a chance to comment on whether we (Lucene) >> >> > > > should >> >> > > > > > > handle >> >> > > > > > > that, or it should be a JFlex fix. Based on the number of >> >> > replies >> >> > > > this >> >> > > > > > > guy >> >> > > > > > > received (0 !), I doubt JFlex would consider it a problem. >> >> > But we >> >> > > > can >> >> > > > > > > do >> >> > > > > > > some small service to our users base by protecting against >> >> > such >> >> > > > > > > problems. >> >> > > > > > > >> >> > > > > > > And while you're opening the issue, if you want to take a >> >> > stab at >> >> > > > > > > fixing it >> >> > > > > > > and post a patch, it'd be great :). >> >> > > > > > > >> >> > > > > > > Shai >> >> > > > > > > >> >> > > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna >> >> > > > > > > <ruben.lag...@gmail.com>wrote: >> >> > > > > > > >> >> > > > > > > > I was investigating this a little further and in the >> >> > > > > > > > JFlex >> >> > > > mailing >> >> > > > > > > list I >> >> > > > > > > > found [1] >> >> > > > > > > > >> >> > > > > > > > I don't know much about flex / JFlex but it seems that >> >> > > > > > > > this >> >> > guy >> >> > > > > > > resets the >> >> > > > > > > > zzBuffer to 16384 or less when setting the input for the >> >> > lexer >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > Quoted from shef <she...@ya...> >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > I set >> >> > > > > > > > >> >> > > > > > > > %buffer 0 >> >> > > > > > > > >> >> > > > > > > > in the options section, and then added this method to the >> >> > > > lexer: >> >> > > > > > > > >> >> > > > > > > > /** >> >> > > > > > > > * Set the input for the lexer. The size parameter >> >> > really >> >> > > > speeds >> >> > > > > > > things >> >> > > > > > > > up, >> >> > > > > > > > * because by default, the lexer allocates an internal >> >> > > > buffer of >> >> > > > > > > 16k. >> >> > > > > > > > For >> >> > > > > > > > * most strings, this is unnecessarily large. If the >> >> > size >> >> > > > param is >> >> > > > > > > > 0 or greater >> >> > > > > > > > * than 16k, then the buffer is set to 16k. If the >> >> > > > > > > > size >> >> > > > param is >> >> > > > > > > > smaller, then >> >> > > > > > > > * the buf will be set to the exact size. >> >> > > > > > > > * @param r the reader that provides the data >> >> > > > > > > > * @param the size of the data in the reader. >> >> > > > > > > > */ >> >> > > > > > > > public void reset(Reader r, int size) { >> >> > > > > > > > if (size == 0 || size > 16384) >> >> > > > > > > > size = 16384; >> >> > > > > > > > zzBuffer = new char[size]; >> >> > > > > > > > yyreset(r); >> >> > > > > > > > } >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > So maybe there is a way to trim the zzBuffer this way >> >> > > > > > > > (?). >> >> > > > > > > > >> >> > > > > > > > BTW, I will try to find out which is the "big token" in >> >> > > > > > > > my >> >> > > > dataset >> >> > > > > > > this >> >> > > > > > > > afternoon. Thanks for the help. >> >> > > > > > > > >> >> > > > > > > > I actually workaround this memory problem for the time >> >> > being by >> >> > > > > > > wrapping >> >> > > > > > > > the >> >> > > > > > > > IndexWriter in a class that periodically closes the >> >> > IndexWriter >> >> > > > and >> >> > > > > > > creates >> >> > > > > > > > a new one, allowing the old to be GCed, but I would be >> >> > really >> >> > > > good if >> >> > > > > > > > either >> >> > > > > > > > JFlex or Lucene can take care of this zzBuffer going >> >> > berserk. >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > Again thanks for the quick response. /Rubén >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > [1] >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > >> >> > > > >> >> > >> >> > https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@ >> >> > > > > > > web38901.mail.mud.yahoo.com >> >> > > > > > > > >> >> > > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera >> >> > <ser...@gmail.com> >> >> > > > > wrote: >> >> > > > > > > > >> >> > > > > > > > > If we could change the Flex file so that >> >> > > > > > > > > yyreset(Reader) >> >> > > > would >> >> > > > > > > check the >> >> > > > > > > > > size of zzBuffer, we could trim it when it gets too >> >> > > > > > > > > big. >> >> > But >> >> > > > I >> >> > > > > > > don't >> >> > > > > > > > think >> >> > > > > > > > > we have such control when writing the flex syntax ... >> >> > yyreset >> >> > > > is >> >> > > > > > > > generated >> >> > > > > > > > > by JFlex and that's the only place I can think of to >> >> > > > > > > > > trim >> >> > the >> >> > > > > > > buffer down >> >> > > > > > > > > when it exceeds a predefined threshold .... >> >> > > > > > > > > >> >> > > > > > > > > Maybe what we can do is create our own method which >> >> > > > > > > > > will >> >> > be >> >> > > > called >> >> > > > > > > by >> >> > > > > > > > > StandardTokenizer after yyreset is called, something >> >> > > > > > > > > like >> >> > > > > > > > > trimBufferIfTooBig(int threshold) which will reallocate >> >> > > > zzBuffer if >> >> > > > > > > it >> >> > > > > > > > > exceeded the threshold. We can decide on a reasonable >> >> > > > > > > > > 64K >> >> > > > threshold >> >> > > > > > > or >> >> > > > > > > > > something, or simply always cut back to 16 KB. As far >> >> > > > > > > > > as >> >> > I >> >> > > > > > > understand, >> >> > > > > > > > that >> >> > > > > > > > > buffer should never grow that much. I.e. in zzRefill, >> >> > which >> >> > > > is the >> >> > > > > > > only >> >> > > > > > > > > place where the buffer gets resized, there is an >> >> > > > > > > > > attempt >> >> > to >> >> > > > first >> >> > > > > > > move >> >> > > > > > > > back >> >> > > > > > > > > characters that were already consumed and only then >> >> > allocate >> >> > > > a >> >> > > > > > > bigger >> >> > > > > > > > > buffer. Which means only if there is a token whose size >> >> > is >> >> > > > larger >> >> > > > > > > than >> >> > > > > > > > 16KB >> >> > > > > > > > > (!?), will this buffer get expanded. >> >> > > > > > > > > >> >> > > > > > > > > A trimBuffer method might not be that bad .. as a >> >> > protective >> >> > > > > > > measure. >> >> > > > > > > > What >> >> > > > > > > > > do you think? Of course, JFlex can fix it on their own >> >> > ... >> >> > > > but >> >> > > > > > > until that >> >> > > > > > > > > happens ... >> >> > > > > > > > > >> >> > > > > > > > > Shai >> >> > > > > > > > > >> >> > > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler >> >> > > > <u...@thetaphi.de> >> >> > > > > > > wrote: >> >> > > > > > > > > >> >> > > > > > > > > > > I would like to identify also the problematic >> >> > document I >> >> > > > have >> >> > > > > > > 10000 >> >> > > > > > > > so, >> >> > > > > > > > > > > what >> >> > > > > > > > > > > would be the best way of identifying the one that >> >> > > > > > > > > > > it >> >> > > > making >> >> > > > > > > zzBuffer >> >> > > > > > > > to >> >> > > > > > > > > > > grow >> >> > > > > > > > > > > without control? >> >> > > > > > > > > > >> >> > > > > > > > > > Dont index your documents, but instead pass them >> >> > directly >> >> > > > to the >> >> > > > > > > > analyzer >> >> > > > > > > > > > and consume the tokenstream manually. Then visit >> >> > > > > > > > > TermAttribute.termLength() >> >> > > > > > > > > > for each Token. >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > ------------------------------------------------------- >> >> > ---- >> >> > > > ------ >> >> > > > > > > ---- >> >> > > > > > > > > > To unsubscribe, e-mail: java-user- >> >> > > > unsubscr...@lucene.apache.org >> >> > > > > > > > > > For additional commands, e-mail: >> >> > > > > java-user-h...@lucene.apache.org >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > -- >> >> > > > > > > > /Rubén >> >> > > > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > --------------------------------------------------------------- >> >> > ---- >> >> > > > -- >> >> > > > > > To unsubscribe, e-mail: >> >> > > > > > java-user-unsubscr...@lucene.apache.org >> >> > > > > > For additional commands, e-mail: java-user- >> >> > h...@lucene.apache.org >> >> > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > -- >> >> > > > /Rubén >> >> > > >> >> > > >> >> > > >> >> > > --------------------------------------------------------------------- >> >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > >> >> > > >> >> > >> >> > >> >> > -- >> >> > /Rubén >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> > >> > -- >> > /Rubén >> > > > > > -- > /Rubén > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org