[jira] Commented: (PDFBOX-569) Text-Extraction of PDF fails

Adam Nichols (JIRA) Tue, 29 Jun 2010 09:55:17 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883611#action_12883611
 ]


Adam Nichols commented on PDFBOX-569:
-------------------------------------

I am not in favor of #2.  While this isn't compliant, I've seen many production 
documents which have this issue.  When some programs merge two documents 
together, they'll just basically concatenate the PDFs.  The PDFs I've looked 
into which have this issue have all said mentioned Acrobat Distiller, but I 
can't confirm that is what created the original PDFs, or what merged them 
together since I don't have that software.

PDFBOX-720 has some notes which are related to the issue of non-unique object 
IDs.  David recommended a proper fix (the whole parsing module needs to be 
rewritten in accordance with the PDF spec) as well as a quick fix (use a 
LinkedHashMap and update parseXrefStreams).  Fixing this may eliminate this 
issue as it'll be looking at the "correct" object.  I don't have time to work 
on PDFBOX-720 right now, but if someone has time to make a patch I can help 
test and confirm whether or not it's working.

If your PDF opens fine in Adobe Reader, then I don't think it should throw an 
exception when meeting an invalid font.  While I'd much rather go by the PDF 
spec instead of what Adobe Reader does, my users are going to say my software 
is inferior is it can't handle PDFs which work fine (in Adobe Reader).  Since 
software is made for the users, it'd probably be a good idea to take their 
perspective into account, even if it means parsing out of spec docs.  It might 
be nice if we had some kind of flag we could check in addition to logging these 
errors.  So if the library detects that a PDF is non-compliant, it can log a 
detailed error message, and set a boolean to true.  That way the calling 
program can handle this is a special way, such as printing a warning message to 
the user.

> Text-Extraction of PDF fails
> ----------------------------
>
>                 Key: PDFBOX-569
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-569
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: 1.6.0_11
>            Reporter: Stephan Götter
>            Priority: Blocker
>         Attachments: b820GL0204.pdf
>
>
> Using trunk this Exception occurs when extracting text of attached PDF.
> [WARN] PDFParser - invalid xref line: 0
> java.io.IOException: Cannot create font if /Type is not /Font.  
> Actual=COSName{FontDescriptor}
>       at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:95)
>       at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:68)
>       at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:117)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
>       at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-569) Text-Extraction of PDF fails

Reply via email to