[jira] [Commented] (PDFBOX-847) FlateFilter.java swallows Exceptions (should rethrow)

Timo Boehme (Commented) (JIRA) Fri, 27 Jan 2012 00:52:26 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194552#comment-13194552
 ]


Timo Boehme commented on PDFBOX-847:
------------------------------------

Regarding determining corrupt stream:
in case of FlateFilter PDFBox relies on the ZIP implementation of Java. If it 
is broken there is not much we can do about it (beside notifying Oracle). 
Alternate Java ZIP implementations are slower (may perform better in 
multi-thread environments).

Regarding OutOfMemoryExceptions:
the reasons for not catching them are explained in the above comments. If the 
OOM is simply because of a large image in most cases it should be enough to do 
PDDocument.load with RandomAccessFile instead of RandomAccessBuffer.
However there are two places where a filter (e.g. FlateFilter) will write to 
memory in every case:
PDStream.getPartiallyFilteredStream and PDInlinedImage.createImage
in both cases the filter writes the result into an ByteArrayOutputStream. Here 
a configurable maximum size which might be enforced by a wrapper class of 
ByteArrayOutputStream could help to prevent OOM. However this was not the 
reason for PDFBOX-453. Thus first we would need a test case showing that this 
is a real problem.

Further more having an OOM while decompressing a stream does not necessarily 
mean that the stream is corrupt. It could also be that you have a memory leak 
or an memory intensive parallel task and only by accident no bytes were left 
while reading the stream.

Regarding skipping pages:
with the current parser the PDF is processed sequentially. Thus streams will be 
read (but not decoded) in every case. If you configure PDFBox to not handle 
images there should be no problem with broken streams (at least if the stream 
is correctly closed by 'endstream'). Again: use RandomAccessFile instead of 
RandomAccessBuffer if you have memory problems.
With the work on a new parser (PDFBOX-1000) it will be possible to only touch 
needed objects and thus image streams can be ignored completely.

Regarding detection of picture PDF (text as scannend image):
you could have an image handler which only stores/provides image dimensions and 
deduce from a whole page image that it could be scanned text. However e.g. with 
journals you might have large background images so that is not a real 
indicator. You need also to count characters on page to decide if page contains 
only scanned text. In my tests even picture PDF documents contained text up to 
60-80 characters because heading/footer was printed as normal text. Thus this 
is quite a tricky task.
                
> FlateFilter.java swallows Exceptions (should rethrow)
> -----------------------------------------------------
>
>                 Key: PDFBOX-847
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-847
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>            Reporter: Andreas Wollschlaeger
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>
> I just re-discovered an issue in FlateFilter.java, which i mentioned quite a 
> while ago on the mailinglist; and which was agreed to be an misfeature :-)
> In FlateFilter.java, at lines 115ff, we find this piece of code:
>                     try 
>                     {
>                         // decoding not needed
>                         while ((amountRead = decompressor.read(buffer, 0, 
> Math.min(mayRead,BUFFER_SIZE))) != -1)
>                         {
>                             result.write(buffer, 0, amountRead);
>                         }
>                     }
>                     catch (OutOfMemoryError exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
>                     catch (ZipException exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
>                     catch (EOFException exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
> which means these Exceptions are discarded and not reported upstream to the 
> caller. This is very infortunate, as the caller has no means to discover that 
> text extraction is incomplete. I discovered this on troubleshooting Alfresco 
> DMS, which uses PDFBox for indexing PDF documents - except an innocent log 
> message, Alfresco does not know that conversion has failed.
> Proposed solution is to re-throw all 3 Exceptions and let the caller handle 
> the exceptions 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-847) FlateFilter.java swallows Exceptions (should rethrow)

Reply via email to