Read non-conforming PDFs (attached) without throwing java.io.IOException: 
expected='endobj' org.apache.pdfbox.io.PushBackInputStream
------------------------------------------------------------------------------------------------------------------------------------

                 Key: PDFBOX-917
                 URL: https://issues.apache.org/jira/browse/PDFBOX-917
             Project: PDFBox
          Issue Type: Improvement
          Components: Parsing
    Affects Versions: 1.3.1
         Environment: Used through Apache Tika 0.8
            Reporter: Alex Rodriguez Lopez


This happened using the following PDF (~2MB): 
http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf

When reading non-conforming PDFs like the one above the following exception is 
thrown and the text extraction partially fails:

WARN - Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' 
org.apache.pdfbox.io.pushbackinputstr...@53ab04
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to