tika-user  

Re: JavaHeapSpace - Parsing 4GB of data recursively

Jukka Zitting
Tue, 24 Nov 2009 05:42:42 -0800

Hi,

On Tue, Nov 24, 2009 at 1:17 PM, Daniel Knapp
<daniel.kn...@mni.fh-giessen.de> wrote:
> i'm trying to parse about 4GB of data. With the following code it always
> results in an JavaHeapSpace Error. I think there must be a better way to do
> this, but i don't know how.
> Has anybody a hint for me how to solve this problem? I think increasing the
> HeapSpace in Eclipse should not be the solution.
> [...]
> StringWriter textBuffer = new StringWriter();

Instead of buffering the text in memory, you can stream it to a file
or some other place. Where are are you planning to put the parse
result?

With Tika 0.5 you could do something as simple as this:

    import org.apache.tika.Tika;

    Reader reader = new Tika().parse(file);

You can then read the parse result incrementally from the reader
object, or pass the reader for example to a Lucene Document for
indexing.

BR,

Jukka Zitting