Re: SV: OutOfMemoryError tokenizing a boring text file

Chris Hostetter Mon, 03 Sep 2007 10:23:22 -0700


: Setting writer.setMaxFieldLength(5000) (default is 10000)
: seems to eliminate the risk for an OutOfMemoryError,


that's because it now gives up after parsing 5000 tokens.

: To me, it appears that simply calling
:    new Field("content", new InputStreamReader(in, "ISO-8859-1"))
: on a plain text file causes Lucene to buffer it *all*.

Looking at this purely from an outside in perspective: how could that

be true? If it was then why would calling setMaxFieldLength(5000)solve your problem -- limiting the number of tokens wouldn't matter if theproblem occured becuase Lucene was buffering the entire reader.

It definitely seems like there is some room for improvement here ... itsounds almost like mayber there is a [HAND WAVEY AIR QUOTES] memory/objectleakish [/HAND WAVEY AIR QUOTES] situation where even after a Token isread off the TokenStream the Token isn't being GCed.

Per: perhaps you could open a Jira issue and attach a unit testdemonstrating the problem? maybe something with an artificial Reader thatjust churns out a repeating sequence of characters forever?





-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: SV: OutOfMemoryError tokenizing a boring text file

Reply via email to