tika-user  

Re: JavaHeapSpace - Parsing 4GB of data recursively

Daniel Knapp
Tue, 24 Nov 2009 05:52:24 -0800

Am 24.11.2009 um 14:41 schrieb Jukka Zitting:

> Hi,
> 
> On Tue, Nov 24, 2009 at 1:17 PM, Daniel Knapp
> <daniel.kn...@mni.fh-giessen.de> wrote:
>> i'm trying to parse about 4GB of data. With the following code it always
>> results in an JavaHeapSpace Error. I think there must be a better way to do
>> this, but i don't know how.
>> Has anybody a hint for me how to solve this problem? I think increasing the
>> HeapSpace in Eclipse should not be the solution.
>> [...]
>> StringWriter textBuffer = new StringWriter();
> 
> Instead of buffering the text in memory, you can stream it to a file
> or some other place. Where are are you planning to put the parse
> result?

I want to send the results to a Solr Server (the integrated handler in Solr is 
no option for me, the files or on another Server).

> 
> With Tika 0.5 you could do something as simple as this:
> 
>    import org.apache.tika.Tika;
> 
>    Reader reader = new Tika().parse(file);
> 
> You can then read the parse result incrementally from the reader
> object, or pass the reader for example to a Lucene Document for
> indexing.

I've read about that. But i don't know how to check when the end of a file is 
reached and merge the result with the related Metadata.

> 
> BR,
> 
> Jukka Zitting

Attachment: smime.p7s
Description: S/MIME cryptographic signature