FlateFilter.java swallows Exceptions (should rethrow)
-----------------------------------------------------
Key: PDFBOX-847
URL: https://issues.apache.org/jira/browse/PDFBOX-847
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.2.1
Reporter: Andreas Wollschlaeger
I just re-discovered an issue in FlateFilter.java, which i mentioned quite a
while ago on the mailinglist; and which was agreed to be an misfeature :-)
In FlateFilter.java, at lines 115ff, we find this piece of code:
try
{
// decoding not needed
while ((amountRead = decompressor.read(buffer, 0,
Math.min(mayRead,BUFFER_SIZE))) != -1)
{
result.write(buffer, 0, amountRead);
}
}
catch (OutOfMemoryError exception)
{
// if the stream is corrupt an OutOfMemoryError may
occur
log.error("Stop reading corrupt stream");
}
catch (ZipException exception)
{
// if the stream is corrupt an OutOfMemoryError may
occur
log.error("Stop reading corrupt stream");
}
catch (EOFException exception)
{
// if the stream is corrupt an OutOfMemoryError may
occur
log.error("Stop reading corrupt stream");
}
which means these Exceptions are discarded and not reported upstream to the
caller. This is very infortunate, as the caller has no means to discover that
text extraction is incomplete. I discovered this on troubleshooting Alfresco
DMS, which uses PDFBox for indexing PDF documents - except an innocent log
message, Alfresco does not know that conversion has failed.
Proposed solution is to re-throw all 3 Exceptions and let the caller handle the
exceptions
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.