Hi, Christian Ortolf schrieb: > Hello, > > is there any possibility to read the text in a pdf without loading > the whole document to RAM? There is no special option to do so, but perhaps there is a workaround. Just try to extract one page after the other, so that for every step the use of resources should be reduced.
> I have the problem that some documents cause OutOfMemory errors. And > increasing the heapsize is not an option... Hmmm, on the other hand there could be an issue with pdfbox. Is it possible to provide us with a sample document, which crashes with a OutOfMemory. If so, please create an issue on jira [1] and attach the pdf to it. > So would it somehow be possible to read in the text of a pdf either > sequentially.. or may be load the PDF without images so size would be > restricted. During textextraction all operators, which aren't needed for the extraction itself, should be skipped. See [2] for details. BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX [2] http://svn.apache.org/viewvc/incubator/pdfbox/trunk/src/main/resources/Resources/PDFTextStripper.properties?view=log