[
https://issues.apache.org/jira/browse/PDFBOX-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988924#comment-13988924
]
Tilman Hausherr commented on PDFBOX-1122:
-----------------------------------------
I was able to parse that one with my own application, both with load() and
loadNonseq() by setting -Xmx3g in the 2.0 version. With the current 1.8
version, I could do it without modifications. Then I downloaded the 1.8.4 app
and used the PDFReader command, and it also worked.
How do you know that Apache nutch is using 1.8.4? A look at their readme shows
this:
https://www.apache.org/dist/nutch/2.2.1/CHANGES-2.2.1.txt
"Upgrade to PDFBox 0.7.3".
And in NUTCH-1770, you write it fails at all PDFs.
> Parsing Error, Skipping Object
> ------------------------------
>
> Key: PDFBOX-1122
> URL: https://issues.apache.org/jira/browse/PDFBOX-1122
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.6.0
> Environment: Working with Windows 7 in eclipse.
> Reporter: Raihan Jamal
> Assignee: Andreas Lehmkühler
> Labels: pdfbox
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Parsing Error, Skipping Object
> java.io.IOException: expected='endstream' actual=''
> org.apache.pdfbox.io.PushBackInputStream@38011d45
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:439)
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1088)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1053)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at org.apache.tika.Tika.parseToString(Tika.java:357)
> at
> edu.uci.ics.crawler4j.crawler.BinaryParser.parse(BinaryParser.java:37)
> at
> edu.uci.ics.crawler4j.crawler.WebCrawler.handleBinary(WebCrawler.java:223)
> at
> edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:462)
> at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:129)
> at java.lang.Thread.run(Thread.java:662)
> Did not found XRef object at specified startxref position 0
> This is the sample URL where I am facing this problem:-
> http://www.qualcomm.com/documents/files/rev-b-enhanced-mobile-broadband-for-all.pdf
> Any suggestions why is it happening...!! Or its a bug??
--
This message was sent by Atlassian JIRA
(v6.2#6252)