Re: Possible to read text without loading PDF to ram?

Andreas Lehmkühler Mon, 19 Oct 2009 23:24:57 -0700

Hi,

Christian Ortolf schrieb:
> Hello,
> 
> is there any possibility to read the text in a  pdf without loading
> the whole document to RAM?
There is no special option to do so, but perhaps there is a workaround.
Just try to extract one page after the other, so that for every step the
use of resources should be reduced.


> I have the problem that some documents cause OutOfMemory errors. And
> increasing the heapsize is not an option...
Hmmm, on the other hand there could be an issue with pdfbox. Is it
possible to provide us with a sample document, which crashes with a
OutOfMemory. If so, please create an issue on jira [1] and attach the
pdf to it.

> So would it somehow be possible to read in the text of a pdf either
> sequentially.. or may be load the PDF without images so size would be
> restricted.
During textextraction all operators, which aren't needed for the
extraction itself, should be skipped. See [2] for details.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX
[2]
http://svn.apache.org/viewvc/incubator/pdfbox/trunk/src/main/resources/Resources/PDFTextStripper.properties?view=log

Re: Possible to read text without loading PDF to ram?

Reply via email to