[
https://issues.apache.org/jira/browse/PDFBOX-982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson updated PDFBOX-982:
-------------------------------
Component/s: (was: Text extraction)
> Unable to convert valid pdf to html
> -----------------------------------
>
> Key: PDFBOX-982
> URL: https://issues.apache.org/jira/browse/PDFBOX-982
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.5.0
> Environment: Ubuntu X86_64, Win7 X86_64
> Reporter: Varun Bhansaly
> Attachments: team21_devel.pdf
>
>
> Encountered an exception while converting a pdf to HTML/ text using
> pdfbox-app-1.5.0.
> The file in this case is "team21_devel.pdf", please note this seems to be a
> valid PDF as it gets opened in adobe reader.
> I have used the command line utility as
> java -jar pdfbox-app-1.5.0.jar ExtractText -html team21_devel.pdf
> The Exception :
> ExtractText failed with the following exception:
> java.io.IOException: Expected='null' actual='nullnullnull'
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1025)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:802)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1011)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:179)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:292)
> at
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1000)
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:533)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:180)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:881)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:846)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:771)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:179)
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)
> Related markmail -
> http://pdfbox-users.markmail.org/message/4hu3awc4a775fczz?q=type:users+list:org%2Eapache%2Epdfbox%2Eusers&page=1
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)