[
https://issues.apache.org/jira/browse/PDFBOX-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Rodriguez Lopez updated PDFBOX-917:
----------------------------------------
Attachment: 2010001615.pdf
This is a non-conforming PDF which, when parsed, throws:
WARN - Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt=''
org.apache.pdfbox.io.pushbackinputstr...@53ab04
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)
> Read non-conforming PDFs (attached) without throwing java.io.IOException:
> expected='endobj' org.apache.pdfbox.io.PushBackInputStream
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-917
> URL: https://issues.apache.org/jira/browse/PDFBOX-917
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 1.3.1
> Environment: Used through Apache Tika 0.8
> Reporter: Alex Rodriguez Lopez
> Attachments: 2010001615.pdf
>
>
> This happened using the following PDF (~2MB):
> http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf
> When reading non-conforming PDFs like the one above the following exception
> is thrown and the text extraction partially fails:
> WARN - Parsing Error, Skipping Object
> java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.apache.pdfbox.io.pushbackinputstr...@53ab04
> at
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.