But the Readers I'm talking about are not held by the Tokenizer (at least not *only* by it), these are held by the DocFieldProccessorPerThread....
IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState -> DocFieldProcessorPerThread -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData) and it's not only one Reader there are several one (one per thread I suppose, in my heapdump there is 25 Reader that should have been GCed otherwise). Best regards/Ruben On Thu, Apr 8, 2010 at 11:49 PM, Uwe Schindler <[email protected]> wrote: > There is one possibility, that could be fixed: > > As Tokenizers are reused, the analyzer holds a reference to the last used > Reader. The easy fix would be to unset the Reader in Tokenizer.close(). If > this is the case for you, that may be easy to do. So Tokenizer.close() looks > like this: > > /** By default, closes the input Reader. */ > @Override > public void close() throws IOException { > input.close(); > input = null; // <-- new! > } > > Any comments from other committers? > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [email protected] > > > > -----Original Message----- > > From: Ruben Laguna [mailto:[email protected]] > > Sent: Thursday, April 08, 2010 2:50 PM > > To: [email protected] > > Subject: Re: IndexWriter memory leak? > > > > I will double check in the afternoon the heapdump.hprof. But I think > > that > > *some* readers are indeed held by > > docWriter.threadStates[0].consumer.fieldHash[1].fields[xxxx], > > as shown in [1] (this heapdump contains only live objects). The > > heapdump > > was taken after IndexWriter.commit() /IndexWriter.optimize() and all > > the > > Documents were already indexed and GCed (I will double check). > > > > So that would mean that the Reader is retained in memory by the > > following > > chaing of references, > > > > DocumentsWriter -> DocumentsWriterThreadState -> > > DocFieldProcessorPerThread > > -> DocFieldProcessorPerField -> Fieldable -> Field (fieldsData) > > > > I'll double check with Eclipse MAT as I said that this chain is > > actually > > made of hard references only (no SoftReferences,WeakReferences, etc). I > > will > > also double check also that there is no "live" Document that is > > referencing > > the Reader via the Field. > > > > > > [1] http://img.skitch.com/20100407-b86irkp7e4uif2wq1dd4t899qb.jpg > > > > On Thu, Apr 8, 2010 at 2:16 PM, Uwe Schindler <[email protected]> wrote: > > > > > Readers are not held. If you indexed the document and gced the > > document > > > instance they readers are gone. > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: [email protected] > > > > > > > > > > -----Original Message----- > > > > From: Ruben Laguna [mailto:[email protected]] > > > > Sent: Thursday, April 08, 2010 1:28 PM > > > > To: [email protected] > > > > Subject: Re: IndexWriter memory leak? > > > > > > > > Now that the zzBuffer issue is solved... > > > > > > > > what about the references to the Readers held by docWriter. Tika´s > > > > ParsingReaders are quite heavyweight so retaining those in memory > > > > unnecesarily is also a "hidden" memory leak. Should I open a bug > > report > > > > on > > > > that one? > > > > > > > > /Rubén > > > > > > > > On Thu, Apr 8, 2010 at 12:11 PM, Shai Erera <[email protected]> > > wrote: > > > > > > > > > Guess we were replying at the same time :). > > > > > > > > > > On Thu, Apr 8, 2010 at 1:04 PM, Uwe Schindler <[email protected]> > > > > wrote: > > > > > > > > > > > I already answered, that I will take care of this! > > > > > > > > > > > > Uwe > > > > > > > > > > > > ----- > > > > > > Uwe Schindler > > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > > > http://www.thetaphi.de > > > > > > eMail: [email protected] > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Shai Erera [mailto:[email protected]] > > > > > > > Sent: Thursday, April 08, 2010 12:00 PM > > > > > > > To: [email protected] > > > > > > > Subject: Re: IndexWriter memory leak? > > > > > > > > > > > > > > Yes, that's the trimBuffer version I was thinking about, only > > > > this guy > > > > > > > created a reset(Reader, int) and does both ops (resetting + > > trim) > > > > in > > > > > > > one > > > > > > > method call. More convenient. Can you please open an issue to > > > > track > > > > > > > that? > > > > > > > People will have a chance to comment on whether we (Lucene) > > > > should > > > > > > > handle > > > > > > > that, or it should be a JFlex fix. Based on the number of > > replies > > > > this > > > > > > > guy > > > > > > > received (0 !), I doubt JFlex would consider it a problem. > > But we > > > > can > > > > > > > do > > > > > > > some small service to our users base by protecting against > > such > > > > > > > problems. > > > > > > > > > > > > > > And while you're opening the issue, if you want to take a > > stab at > > > > > > > fixing it > > > > > > > and post a patch, it'd be great :). > > > > > > > > > > > > > > Shai > > > > > > > > > > > > > > On Thu, Apr 8, 2010 at 12:51 PM, Ruben Laguna > > > > > > > <[email protected]>wrote: > > > > > > > > > > > > > > > I was investigating this a little further and in the JFlex > > > > mailing > > > > > > > list I > > > > > > > > found [1] > > > > > > > > > > > > > > > > I don't know much about flex / JFlex but it seems that this > > guy > > > > > > > resets the > > > > > > > > zzBuffer to 16384 or less when setting the input for the > > lexer > > > > > > > > > > > > > > > > > > > > > > > > Quoted from shef <she...@ya...> > > > > > > > > > > > > > > > > > > > > > > > > I set > > > > > > > > > > > > > > > > %buffer 0 > > > > > > > > > > > > > > > > in the options section, and then added this method to the > > > > lexer: > > > > > > > > > > > > > > > > /** > > > > > > > > * Set the input for the lexer. The size parameter > > really > > > > speeds > > > > > > > things > > > > > > > > up, > > > > > > > > * because by default, the lexer allocates an internal > > > > buffer of > > > > > > > 16k. > > > > > > > > For > > > > > > > > * most strings, this is unnecessarily large. If the > > size > > > > param is > > > > > > > > 0 or greater > > > > > > > > * than 16k, then the buffer is set to 16k. If the size > > > > param is > > > > > > > > smaller, then > > > > > > > > * the buf will be set to the exact size. > > > > > > > > * @param r the reader that provides the data > > > > > > > > * @param the size of the data in the reader. > > > > > > > > */ > > > > > > > > public void reset(Reader r, int size) { > > > > > > > > if (size == 0 || size > 16384) > > > > > > > > size = 16384; > > > > > > > > zzBuffer = new char[size]; > > > > > > > > yyreset(r); > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > So maybe there is a way to trim the zzBuffer this way (?). > > > > > > > > > > > > > > > > BTW, I will try to find out which is the "big token" in my > > > > dataset > > > > > > > this > > > > > > > > afternoon. Thanks for the help. > > > > > > > > > > > > > > > > I actually workaround this memory problem for the time > > being by > > > > > > > wrapping > > > > > > > > the > > > > > > > > IndexWriter in a class that periodically closes the > > IndexWriter > > > > and > > > > > > > creates > > > > > > > > a new one, allowing the old to be GCed, but I would be > > really > > > > good if > > > > > > > > either > > > > > > > > JFlex or Lucene can take care of this zzBuffer going > > berserk. > > > > > > > > > > > > > > > > > > > > > > > > Again thanks for the quick response. /Rubén > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@ > > > > > > > web38901.mail.mud.yahoo.com > > > > > > > > > > > > > > > > On Thu, Apr 8, 2010 at 11:32 AM, Shai Erera > > <[email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > If we could change the Flex file so that yyreset(Reader) > > > > would > > > > > > > check the > > > > > > > > > size of zzBuffer, we could trim it when it gets too big. > > But > > > > I > > > > > > > don't > > > > > > > > think > > > > > > > > > we have such control when writing the flex syntax ... > > yyreset > > > > is > > > > > > > > generated > > > > > > > > > by JFlex and that's the only place I can think of to trim > > the > > > > > > > buffer down > > > > > > > > > when it exceeds a predefined threshold .... > > > > > > > > > > > > > > > > > > Maybe what we can do is create our own method which will > > be > > > > called > > > > > > > by > > > > > > > > > StandardTokenizer after yyreset is called, something like > > > > > > > > > trimBufferIfTooBig(int threshold) which will reallocate > > > > zzBuffer if > > > > > > > it > > > > > > > > > exceeded the threshold. We can decide on a reasonable 64K > > > > threshold > > > > > > > or > > > > > > > > > something, or simply always cut back to 16 KB. As far as > > I > > > > > > > understand, > > > > > > > > that > > > > > > > > > buffer should never grow that much. I.e. in zzRefill, > > which > > > > is the > > > > > > > only > > > > > > > > > place where the buffer gets resized, there is an attempt > > to > > > > first > > > > > > > move > > > > > > > > back > > > > > > > > > characters that were already consumed and only then > > allocate > > > > a > > > > > > > bigger > > > > > > > > > buffer. Which means only if there is a token whose size > > is > > > > larger > > > > > > > than > > > > > > > > 16KB > > > > > > > > > (!?), will this buffer get expanded. > > > > > > > > > > > > > > > > > > A trimBuffer method might not be that bad .. as a > > protective > > > > > > > measure. > > > > > > > > What > > > > > > > > > do you think? Of course, JFlex can fix it on their own > > ... > > > > but > > > > > > > until that > > > > > > > > > happens ... > > > > > > > > > > > > > > > > > > Shai > > > > > > > > > > > > > > > > > > On Thu, Apr 8, 2010 at 10:35 AM, Uwe Schindler > > > > <[email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > I would like to identify also the problematic > > document I > > > > have > > > > > > > 10000 > > > > > > > > so, > > > > > > > > > > > what > > > > > > > > > > > would be the best way of identifying the one that it > > > > making > > > > > > > zzBuffer > > > > > > > > to > > > > > > > > > > > grow > > > > > > > > > > > without control? > > > > > > > > > > > > > > > > > > > > Dont index your documents, but instead pass them > > directly > > > > to the > > > > > > > > analyzer > > > > > > > > > > and consume the tokenstream manually. Then visit > > > > > > > > > TermAttribute.termLength() > > > > > > > > > > for each Token. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > ---- > > > > ------ > > > > > > > ---- > > > > > > > > > > To unsubscribe, e-mail: java-user- > > > > [email protected] > > > > > > > > > > For additional commands, e-mail: > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > /Rubén > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------- > > ---- > > > > -- > > > > > > To unsubscribe, e-mail: [email protected] > > > > > > For additional commands, e-mail: java-user- > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > /Rubén > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > > > > > -- > > /Rubén > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- /Rubén
