[
https://issues.apache.org/jira/browse/PDFBOX-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler resolved PDFBOX-1744.
----------------------------------------
Resolution: Fixed
Fix Version/s: 2.0.0
Assignee: Andreas Lehmkühler
I applied the patch in revision 1536463 as proposed.
Thanks for the contribution!
> Be resilient to PDFs with missing version info
> ----------------------------------------------
>
> Key: PDFBOX-1744
> URL: https://issues.apache.org/jira/browse/PDFBOX-1744
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 1.8.2
> Environment: PDFBox 1.8.2, IntelliJ IDEA 12.1.6, Mac OS X 10.7.5,
> Java 1.7, Maven 2.2.1
> Reporter: Chris Bamford
> Assignee: Andreas Lehmkühler
> Priority: Minor
> Fix For: 1.8.3, 2.0.0
>
> Attachments: no_version.pdf, pdfbox.patch
>
>
> Proposed addition to 1.8.2 ->
> pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java ->
> parseHeader() to default the PDF version to 1.4 in situations where it is
> missing (yes, there really are docs out there like this!).
> This prevents an exception caused from a negative substring offset
> calculation: "String index out of range: -3"
> I have floated the question on the [email protected] mailing list (10th
> October 2013) and it was suggested I default the PDF version to 1.4 in this
> scenario. I have tested it locally and it works (apparently PDFBox doesn't
> take the version number into account anyway).
> Now over to you guys to decide if this is a good idea or not in the wider
> scope.
> Should you give the green light, I attach:
> 1) a sample file which causes the exception
> 2) a patch file
> 3) patching instructions.
> My goal is text extraction, even on broken files (if possible).
--
This message was sent by Atlassian JIRA
(v6.1#6144)