[ 
https://issues.apache.org/jira/browse/PDFBOX-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952604#comment-15952604
 ] 

ASF subversion and git services commented on PDFBOX-3742:
---------------------------------------------------------

Commit 1789869 from [~tilman] in branch 'pdfbox/branches/2.0'
[ https://svn.apache.org/r1789869 ]

PDFBOX-3742: assume that valid non-blank sequence after EI is Q or EMC

> Unknown dir object c='>' cInt=62 peek='>' peekInt=62
> ----------------------------------------------------
>
>                 Key: PDFBOX-3742
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3742
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.5
>         Environment: Based on Tika Docker image: 
> logicalspark/docker-tikaserver
>            Reporter: Igor Santos
>         Attachments: buggy.pdf, screenshot_002.png
>
>
> This was originally stumbled upon when running a 69-page long PDF through 
> Tika. I could isolate the issue to in-between those two pages. Tika ends up 
> responding with a faulty XML, as the attached screenshot shows - together 
> with a stacktrace on the logs that includes the PDFBox exception, shown below 
> as reproduced from the standalone CLI tool.
> I'm using Tika 1.1.4, although I'm not exactly sure what version of PDFBox it 
> uses. Here's the base 
> [Dockerfile|https://github.com/LogicalSpark/docker-tikaserver/blob/master/Dockerfile].
> {code}
> $ java -jar pdfbox-app-2.0.5.jar ExtractText buggy.pdf 
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSans-Bold' for 'Arial-BoldMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSans' for 'ArialMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
> WARNING: Corrupt object reference at offset 150196
> Exception in thread "main" java.io.IOException: Unknown dir object c='>' 
> cInt=62 peek='>' peekInt=62 at offset 150196
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:954)
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:654)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>       at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>       at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
>       at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
>       at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}
> Seems related to PDFBOX-1327.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to