[
https://issues.apache.org/jira/browse/PDFBOX-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883611#action_12883611
]
Adam Nichols commented on PDFBOX-569:
-------------------------------------
I am not in favor of #2. While this isn't compliant, I've seen many production
documents which have this issue. When some programs merge two documents
together, they'll just basically concatenate the PDFs. The PDFs I've looked
into which have this issue have all said mentioned Acrobat Distiller, but I
can't confirm that is what created the original PDFs, or what merged them
together since I don't have that software.
PDFBOX-720 has some notes which are related to the issue of non-unique object
IDs. David recommended a proper fix (the whole parsing module needs to be
rewritten in accordance with the PDF spec) as well as a quick fix (use a
LinkedHashMap and update parseXrefStreams). Fixing this may eliminate this
issue as it'll be looking at the "correct" object. I don't have time to work
on PDFBOX-720 right now, but if someone has time to make a patch I can help
test and confirm whether or not it's working.
If your PDF opens fine in Adobe Reader, then I don't think it should throw an
exception when meeting an invalid font. While I'd much rather go by the PDF
spec instead of what Adobe Reader does, my users are going to say my software
is inferior is it can't handle PDFs which work fine (in Adobe Reader). Since
software is made for the users, it'd probably be a good idea to take their
perspective into account, even if it means parsing out of spec docs. It might
be nice if we had some kind of flag we could check in addition to logging these
errors. So if the library detects that a PDF is non-compliant, it can log a
detailed error message, and set a boolean to true. That way the calling
program can handle this is a special way, such as printing a warning message to
the user.
> Text-Extraction of PDF fails
> ----------------------------
>
> Key: PDFBOX-569
> URL: https://issues.apache.org/jira/browse/PDFBOX-569
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: 1.6.0_11
> Reporter: Stephan Götter
> Priority: Blocker
> Attachments: b820GL0204.pdf
>
>
> Using trunk this Exception occurs when extracting text of attached PDF.
> [WARN] PDFParser - invalid xref line: 0
> java.io.IOException: Cannot create font if /Type is not /Font.
> Actual=COSName{FontDescriptor}
> at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:95)
> at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:68)
> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:117)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
> at
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> at
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
> at
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.