[ 
https://issues.apache.org/jira/browse/PDFBOX-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194510#comment-13194510
 ] 

Mahesh Yadav commented on PDFBOX-847:
-------------------------------------

Will this not open PDFBOX-453 bug?.

I am having this issue of getting log messages "FlateFilter: stop reading 
corrupt stream" and it crashes my application.

Users are uploading scanned documents saved as pdf ranging from 20-80 MB. Is 
there no mechanism by which we determine that incoming stream is corrupted?.

Or else in my case at least will it be possible to find that pdf page/(whole 
pdf) contains scanned image so that I can skip text extraction of that page. 
Does that help?.


Any help would be appreciated.

Thanks
Mahesh
                
> FlateFilter.java swallows Exceptions (should rethrow)
> -----------------------------------------------------
>
>                 Key: PDFBOX-847
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-847
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>            Reporter: Andreas Wollschlaeger
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>
> I just re-discovered an issue in FlateFilter.java, which i mentioned quite a 
> while ago on the mailinglist; and which was agreed to be an misfeature :-)
> In FlateFilter.java, at lines 115ff, we find this piece of code:
>                     try 
>                     {
>                         // decoding not needed
>                         while ((amountRead = decompressor.read(buffer, 0, 
> Math.min(mayRead,BUFFER_SIZE))) != -1)
>                         {
>                             result.write(buffer, 0, amountRead);
>                         }
>                     }
>                     catch (OutOfMemoryError exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
>                     catch (ZipException exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
>                     catch (EOFException exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
> which means these Exceptions are discarded and not reported upstream to the 
> caller. This is very infortunate, as the caller has no means to discover that 
> text extraction is incomplete. I discovered this on troubleshooting Alfresco 
> DMS, which uses PDFBox for indexing PDF documents - except an innocent log 
> message, Alfresco does not know that conversion has failed.
> Proposed solution is to re-throw all 3 Exceptions and let the caller handle 
> the exceptions 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to