[ 
https://issues.apache.org/jira/browse/PDFBOX-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194577#comment-13194577
 ] 

Mahesh Yadav commented on PDFBOX-847:
-------------------------------------

Thanks Timo, appreciate your quick response

We have heavy DMS usage and we are using jackrabbit as repository. Our server 
got crashed when some users uploaded pdf around 50-80 MB scanned pdf document 
(this time there was no FlateFilter message).  I am looking forward to your  to 
your suggestion (RandomAccessFile instead of RandomAccessBuffer).

We use jackrabbit and only difference that we have is we have our own custom 
parser (not provided by jackrabbit) for parsing pdf and we interact with pdfbox 
as shown below.

PDFParser parser = new PDFParser(new BufferedInputStream(stream));
PDDocument document = parser.getPDDocument(); 
parser.parse();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setLineSeparator("\n");
stripper.writeText(document, writer)

I think we need to change above approach and use " PDDocument.load" with 
RandomAccessFile 

Thanks  
Mahesh

                
> FlateFilter.java swallows Exceptions (should rethrow)
> -----------------------------------------------------
>
>                 Key: PDFBOX-847
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-847
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.1
>            Reporter: Andreas Wollschlaeger
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>
> I just re-discovered an issue in FlateFilter.java, which i mentioned quite a 
> while ago on the mailinglist; and which was agreed to be an misfeature :-)
> In FlateFilter.java, at lines 115ff, we find this piece of code:
>                     try 
>                     {
>                         // decoding not needed
>                         while ((amountRead = decompressor.read(buffer, 0, 
> Math.min(mayRead,BUFFER_SIZE))) != -1)
>                         {
>                             result.write(buffer, 0, amountRead);
>                         }
>                     }
>                     catch (OutOfMemoryError exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
>                     catch (ZipException exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
>                     catch (EOFException exception) 
>                     {
>                         // if the stream is corrupt an OutOfMemoryError may 
> occur
>                         log.error("Stop reading corrupt stream");
>                     }
> which means these Exceptions are discarded and not reported upstream to the 
> caller. This is very infortunate, as the caller has no means to discover that 
> text extraction is incomplete. I discovered this on troubleshooting Alfresco 
> DMS, which uses PDFBox for indexing PDF documents - except an innocent log 
> message, Alfresco does not know that conversion has failed.
> Proposed solution is to re-throw all 3 Exceptions and let the caller handle 
> the exceptions 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to