[
https://issues.apache.org/jira/browse/PDFBOX-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194552#comment-13194552
]
Timo Boehme commented on PDFBOX-847:
------------------------------------
Regarding determining corrupt stream:
in case of FlateFilter PDFBox relies on the ZIP implementation of Java. If it
is broken there is not much we can do about it (beside notifying Oracle).
Alternate Java ZIP implementations are slower (may perform better in
multi-thread environments).
Regarding OutOfMemoryExceptions:
the reasons for not catching them are explained in the above comments. If the
OOM is simply because of a large image in most cases it should be enough to do
PDDocument.load with RandomAccessFile instead of RandomAccessBuffer.
However there are two places where a filter (e.g. FlateFilter) will write to
memory in every case:
PDStream.getPartiallyFilteredStream and PDInlinedImage.createImage
in both cases the filter writes the result into an ByteArrayOutputStream. Here
a configurable maximum size which might be enforced by a wrapper class of
ByteArrayOutputStream could help to prevent OOM. However this was not the
reason for PDFBOX-453. Thus first we would need a test case showing that this
is a real problem.
Further more having an OOM while decompressing a stream does not necessarily
mean that the stream is corrupt. It could also be that you have a memory leak
or an memory intensive parallel task and only by accident no bytes were left
while reading the stream.
Regarding skipping pages:
with the current parser the PDF is processed sequentially. Thus streams will be
read (but not decoded) in every case. If you configure PDFBox to not handle
images there should be no problem with broken streams (at least if the stream
is correctly closed by 'endstream'). Again: use RandomAccessFile instead of
RandomAccessBuffer if you have memory problems.
With the work on a new parser (PDFBOX-1000) it will be possible to only touch
needed objects and thus image streams can be ignored completely.
Regarding detection of picture PDF (text as scannend image):
you could have an image handler which only stores/provides image dimensions and
deduce from a whole page image that it could be scanned text. However e.g. with
journals you might have large background images so that is not a real
indicator. You need also to count characters on page to decide if page contains
only scanned text. In my tests even picture PDF documents contained text up to
60-80 characters because heading/footer was printed as normal text. Thus this
is quite a tricky task.
> FlateFilter.java swallows Exceptions (should rethrow)
> -----------------------------------------------------
>
> Key: PDFBOX-847
> URL: https://issues.apache.org/jira/browse/PDFBOX-847
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.2.1
> Reporter: Andreas Wollschlaeger
> Assignee: Andreas Lehmkühler
> Fix For: 1.7.0
>
>
> I just re-discovered an issue in FlateFilter.java, which i mentioned quite a
> while ago on the mailinglist; and which was agreed to be an misfeature :-)
> In FlateFilter.java, at lines 115ff, we find this piece of code:
> try
> {
> // decoding not needed
> while ((amountRead = decompressor.read(buffer, 0,
> Math.min(mayRead,BUFFER_SIZE))) != -1)
> {
> result.write(buffer, 0, amountRead);
> }
> }
> catch (OutOfMemoryError exception)
> {
> // if the stream is corrupt an OutOfMemoryError may
> occur
> log.error("Stop reading corrupt stream");
> }
> catch (ZipException exception)
> {
> // if the stream is corrupt an OutOfMemoryError may
> occur
> log.error("Stop reading corrupt stream");
> }
> catch (EOFException exception)
> {
> // if the stream is corrupt an OutOfMemoryError may
> occur
> log.error("Stop reading corrupt stream");
> }
> which means these Exceptions are discarded and not reported upstream to the
> caller. This is very infortunate, as the caller has no means to discover that
> text extraction is incomplete. I discovered this on troubleshooting Alfresco
> DMS, which uses PDFBox for indexing PDF documents - except an innocent log
> message, Alfresco does not know that conversion has failed.
> Proposed solution is to re-throw all 3 Exceptions and let the caller handle
> the exceptions
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira