[jira] [Commented] (PDFBOX-5152) Content Stream Appears Truncated in Specific File

Tilman Hausherr (Jira) Thu, 08 Apr 2021 10:34:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317373#comment-17317373
 ]


Tilman Hausherr commented on PDFBOX-5152:
-----------------------------------------

Instead of overriding processStreamOperators, the "official" way is to override 
the handlers for the individual operators. We are usually reluctant to expose 
more internals because this means that we can't change it anymore, and people 
will use it in surprising ways.

Feel free to explain your concept on the users mailing list. Be aware that 
there are many different types of colorspaces, some are nested.

> Content Stream Appears Truncated in Specific File
> -------------------------------------------------
>
>                 Key: PDFBOX-5152
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5152
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.23
>            Reporter: Steven Fontaine
>            Priority: Minor
>
> I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert 
> the colors of a PDF file. An 
> [issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which 
> provided a [PDF 
> file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf], 
> which when parsed by pdfbox, appears to give a truncated content stream. That 
> is, running the following code results in a substantially shorter content 
> stream than I would expect:
> {code:java}
> try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
>   for (PDPage page: doc.getPages()) {
>     String stream = new String(IOUtils.toByteArray(page.getContents()), 
> StandardCharsets.UTF_8);
>     System.out.println(stream);
>   }
> }
> {code}
>  The code outputs the following:
> {noformat}
> q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q 
> {noformat}
> I'll admit that I don't have the strongest of understandings of PDF content 
> streams, but I can fairly confidently say that more than this is required to 
> draw page 1 of the PDF.
> Additionally, you can deduce from the linked issue that, internally, pdfbox 
> is making reference to additional data that isn't contained in the content 
> stream returned from {{page.getContents()}}.
> In my program, I need to find specific substrings in the content stream to 
> locate specific operations and their arguments. To do so, I [wrap 
> {{PDFStreamParser.parseNextToken()}} with queries to 
> {{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52].
>  I do so in order to get the bounds of a token in the content stream, without 
> the need to parse it myself, (allowing {{parseNextToken}} to do the work for 
> me.) When I look at the bounds which these queries give me, they extend 
> further than the length of the content stream returned by 
> {{page.getContents()}}.
> Specifically, one set of these bounds is (19, 313), inclusive. In other 
> words, the token parsed by {{parseNextToken}} corresponds to characters 
> 19-313 (inclusive, 0-based index) of the content stream. But the content 
> stream returned by {{page.getContents()}} doesn't contain 313 characters.
> Hopefully someone can shed some light on this issue for me. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5152) Content Stream Appears Truncated in Specific File

Reply via email to