[jira] Commented: (PDFBOX-917) Read non-conforming PDFs (attached) without throwing java.io.IOException: expected='endobj' org.apache.pdfbox.io.PushBackInputStream

Martijn Brinkers (JIRA) Fri, 10 Dec 2010 02:42:28 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970123#action_12970123
 ]


Martijn Brinkers commented on PDFBOX-917:
-----------------------------------------

The problem is cause because the endobj is either missing or corrupt. The 
current code in trunk that tries to handle this situation seems to be somewhat 
too complicated because it fails to handle the missing endobj in a lot of 
cases.  I have added a patch that seems to handle non conforming PDFs with 
missing endobj (which happens quite often) better for most cases (on a large 
number of PDFs ebooks)

> Read non-conforming PDFs (attached) without throwing java.io.IOException: 
> expected='endobj' org.apache.pdfbox.io.PushBackInputStream
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-917
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-917
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.3.1
>         Environment: Used through Apache Tika 0.8
>            Reporter: Alex Rodriguez Lopez
>         Attachments: 2010001615.pdf, PDFBOX-917.patch
>
>
> This happened using the following PDF (~2MB): 
> http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf
> When reading non-conforming PDFs like the one above the following exception 
> is thrown and the text extraction partially fails:
> WARN - Parsing Error, Skipping Object
> java.io.IOException: expected='endobj' firstReadAttempt='' 
> secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>         at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>         at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-917) Read non-conforming PDFs (attached) without throwing java.io.IOException: expected='endobj' org.apache.pdfbox.io.PushBackInputStream

Reply via email to