Hi All,
I am using lucene in an embedded environment and I need to keep use of
memory under control. In investigating a problem with big pdf files (a few
Mb), I noticed that Parse.parse takes an InputStream as parameter but then
PDFParser has the following code:

TikaInputStream tstream = TikaInputStream.cast(stream);
            if (tstream != null && tstream.hasFile()) {
                // File based, take that as a cue to use a temporary file
                RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(stream), scratchFile);
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), scratchFile, true);
                }
            } else {
                // Go for the normal, stream based in-memory parsing
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(stream), new RandomAccessBuffer());
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), true);
                }
            }

I am not sure tstream.hasFile() can ever be true, from my understanding of
the code it can be only false. Therefore the "else" triggers and the stream
is managed in memory. I suspect this means the stream (or a good part of
it) is read in memory somewhere when managed, potentially using a lot of
memory.

I have then tried a different approach, adding a version of parse() that
accepts a file instead of a stream. The code above will then become:

TikaInputStream tstream = TikaInputStream.get(file);
            if (tstream != null && tstream.hasFile()) {
                // File based, take that as a cue to use a temporary file
                RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(tstream), scratchFile);
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(tstream), scratchFile, true);
                }
            } else {
                // Go for the normal, stream based in-memory parsing
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(tstream), new RandomAccessBuffer());
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(tstream), true);
                }
            }

(but do we really need the && in the if?)

This is much more friendly with memory usage; with the first version of the
method I could not parse a file of 4.3Mb running the JVM with 16M while I
have parsed it successfully with the second approach.

What do you think about extending the Parse interface accordingly? would
you be interested in a patch that does it?

Ste

Reply via email to