[ 
https://issues.apache.org/jira/browse/PDFBOX-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952588#comment-15952588
 ] 

Tilman Hausherr commented on PDFBOX-3742:
-----------------------------------------

It's in the content stream of the second page, the end of the inline image is 
not recognized properly. There is this near offset 150196:
{code}
#HH!EI DJ CK"CN$
{code}
An inline image ends with "EI" and then a blank character. And an inline image 
should be 4KB or less. Yours is much more.

The problem is that here, EI is part of the data, and not the end of the image. 
I added a lot of heuristics to catch such "false ends" but yours an additional 
space after "DJ", so PDFBox thought this is a 2 character PDF operator, thus 
the image ends there. Of course there is no "DJ" PDF operator...

> Unknown dir object c='>' cInt=62 peek='>' peekInt=62
> ----------------------------------------------------
>
>                 Key: PDFBOX-3742
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3742
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.5
>         Environment: Based on Tika Docker image: 
> logicalspark/docker-tikaserver
>            Reporter: Igor Santos
>         Attachments: buggy.pdf, screenshot_002.png
>
>
> This was originally stumbled upon when running a 69-page long PDF through 
> Tika. I could isolate the issue to in-between those two pages. Tika ends up 
> responding with a faulty XML, as the attached screenshot shows - together 
> with a stacktrace on the logs that includes the PDFBox exception, shown below 
> as reproduced from the standalone CLI tool.
> I'm using Tika 1.1.4, although I'm not exactly sure what version of PDFBox it 
> uses. Here's the base 
> [Dockerfile|https://github.com/LogicalSpark/docker-tikaserver/blob/master/Dockerfile].
> {code}
> $ java -jar pdfbox-app-2.0.5.jar ExtractText buggy.pdf 
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSans-Bold' for 'Arial-BoldMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSans' for 'ArialMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
> WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
> Apr 01, 2017 10:08:44 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray
> WARNING: Corrupt object reference at offset 150196
> Exception in thread "main" java.io.IOException: Unknown dir object c='>' 
> cInt=62 peek='>' peekInt=62 at offset 150196
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:954)
>       at 
> org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:654)
>       at 
> org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:175)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
>       at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
>       at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
>       at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
>       at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:237)
>       at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
>       at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}
> Seems related to PDFBOX-1327.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to