tika-user  

Re: JavaHeapSpace - Parsing 4GB of data recursively

Daniel Knapp
Thu, 26 Nov 2009 04:08:35 -0800

>>> With Tika 0.5 you could do something as simple as this:
>>> 
>>>    import org.apache.tika.Tika;
>>> 
>>>    Reader reader = new Tika().parse(file);
>>> 
>>> You can then read the parse result incrementally from the reader
>>> object, or pass the reader for example to a Lucene Document for
>>> indexing.
>> 
>> I've read about that. But i don't know how to check when the end of a
>> file is reached and merge the result with the related Metadata.
> 
> You could also do the following:
> 
>    Metadata metadata = new Metadata();
>    metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
>    Reader reader =
>        new Tika().parse(new FileInputStream(file), metadata);
> 
> Most of the extracted metadata will be available as soon as the
> parse() method returns so you don't need to wait until you've read the
> entire stream first.
> 
> The read() methods of the reader will return -1 when you've reached
> the end of the file. Note also that unlike with the Parser.parse()
> call, the InputStream you pass to Tika.parse() will get closed when
> you call the close() method on the returned Reader.

Okay, thanks for that hint. But should i deal with the extracted content? 
Actually i'm using the following Code:

        Reader read = tik.parse(input, md);
        System.out.println(o + " - " + file.getAbsolutePath());
        System.out.println("Content-Type: " + md.get("Content-Type")
                                                + "\n");

        BufferedReader br = new BufferedReader(read);
        String tmp = "";
        StringBuilder sb = new StringBuilder();
        while ((tmp = br.readLine()) != null) {
                sb.append(tmp);
        }
        System.out.print(sb.toString());
        br.close();
        read.close();

But think this can't be the best solution, my memory gets fuller and fuller. Is 
this so exotic or do i overlook a detail?

> 
> BR,
> 
> Jukka Zitting

Attachment: smime.p7s
Description: S/MIME cryptographic signature