tika-user  

Re: JavaHeapSpace - Parsing 4GB of data recursively

Jukka Zitting
Tue, 24 Nov 2009 06:08:31 -0800

Hi,

On Tue, Nov 24, 2009 at 2:51 PM, Daniel Knapp
<daniel.kn...@mni.fh-giessen.de> wrote:
> Am 24.11.2009 um 14:41 schrieb Jukka Zitting:
>> Instead of buffering the text in memory, you can stream it to a file
>> or some other place. Where are are you planning to put the parse
>> result?
>
> I want to send the results to a Solr Server (the integrated handler in Solr is
> no option for me, the files or on another Server).

You should be able to stream the extracted text as a part of the
request that you post to Solr, but I'm not sure how easy that is, i.e.
whether for example SolrJ supports that. You may want to ask the
solr-user@ list about that.

>> With Tika 0.5 you could do something as simple as this:
>>
>>    import org.apache.tika.Tika;
>>
>>    Reader reader = new Tika().parse(file);
>>
>> You can then read the parse result incrementally from the reader
>> object, or pass the reader for example to a Lucene Document for
>> indexing.
>
> I've read about that. But i don't know how to check when the end of a
> file is reached and merge the result with the related Metadata.

You could also do the following:

    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
    Reader reader =
        new Tika().parse(new FileInputStream(file), metadata);

Most of the extracted metadata will be available as soon as the
parse() method returns so you don't need to wait until you've read the
entire stream first.

The read() methods of the reader will return -1 when you've reached
the end of the file. Note also that unlike with the Parser.parse()
call, the InputStream you pass to Tika.parse() will get closed when
you call the close() method on the returned Reader.

BR,

Jukka Zitting