[jira] [Commented] (PDFBOX-5152) Content Stream Appears Truncated in Specific File

Tilman Hausherr (Jira) Mon, 05 Apr 2021 23:37:12 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315277#comment-17315277
 ]


Tilman Hausherr commented on PDFBOX-5152:
-----------------------------------------

"/Fm0 Do" is more than just including a nested content stream, see in 8.10 Form 
XObjects. The PDF specification is over 700 pages and this doesn't include a 
few 1000 more pages about fonts. A good way to start is to go to Annex A, 
operation summary.
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

To see how to change content streams, see the RemoveAllTests example in the 
source code download. This is about a different change but might still be 
useful.

Your project is tricky because there are many ways to set / influence colors.

> Content Stream Appears Truncated in Specific File
> -------------------------------------------------
>
>                 Key: PDFBOX-5152
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5152
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.23
>            Reporter: Steven Fontaine
>            Priority: Minor
>
> I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert 
> the colors of a PDF file. An 
> [issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which 
> provided a [PDF 
> file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf], 
> which when parsed by pdfbox, appears to give a truncated content stream. That 
> is, running the following code results in a substantially shorter content 
> stream than I would expect:
> {code:java}
> try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
>   for (PDPage page: doc.getPages()) {
>     String stream = new String(IOUtils.toByteArray(page.getContents()), 
> StandardCharsets.UTF_8);
>     System.out.println(stream);
>   }
> }
> {code}
>  The code outputs the following:
> {noformat}
> q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q 
> {noformat}
> I'll admit that I don't have the strongest of understandings of PDF content 
> streams, but I can fairly confidently say that more than this is required to 
> draw page 1 of the PDF.
> Additionally, you can deduce from the linked issue that, internally, pdfbox 
> is making reference to additional data that isn't contained in the content 
> stream returned from {{page.getContents()}}.
> In my program, I need to find specific substrings in the content stream to 
> locate specific operations and their arguments. To do so, I [wrap 
> {{PDFStreamParser.parseNextToken()}} with queries to 
> {{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52].
>  I do so in order to get the bounds of a token in the content stream, without 
> the need to parse it myself, (allowing {{parseNextToken}} to do the work for 
> me.) When I look at the bounds which these queries give me, they extend 
> further than the length of the content stream returned by 
> {{page.getContents()}}.
> Specifically, one set of these bounds is (19, 313), inclusive. In other 
> words, the token parsed by {{parseNextToken}} corresponds to characters 
> 19-313 (inclusive, 0-based index) of the content stream. But the content 
> stream returned by {{page.getContents()}} doesn't contain 313 characters.
> Hopefully someone can shed some light on this issue for me. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5152) Content Stream Appears Truncated in Specific File

Reply via email to