[jira] [Commented] (PDFBOX-5152) Content Stream Appears Truncated in Specific File

Steven Fontaine (Jira) Mon, 05 Apr 2021 23:28:12 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315270#comment-17315270
 ]


Steven Fontaine commented on PDFBOX-5152:
-----------------------------------------

Hi, Tilman. After digging much deeper than it seems I did previously (the 
original issue is several months old, and my memory must have been fuzzy,) I've 
come to the same conclusion as you. The bulk of the PDF data is in the {{/Fm0}} 
object. I was unaware that XObjects could be used to draw so much to the 
document. Perhaps mentioning this somewhere in the pdfbox documentation could 
help prevent an issue like this in the future? But obviously educating users 
about the PDF standard is far outside the scope of pdfbox. Another interesting 
solution would be generating an "effective content stream," which essentially 
concatenates all content streams together into a single, large stream. But 
again, that seems outside the scope and isn't what this issue is about. Either 
way, I'm currently working on a fix for this in my application. Thanks for your 
time.

> Content Stream Appears Truncated in Specific File
> -------------------------------------------------
>
>                 Key: PDFBOX-5152
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5152
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.23
>            Reporter: Steven Fontaine
>            Priority: Minor
>
> I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert 
> the colors of a PDF file. An 
> [issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which 
> provided a [PDF 
> file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf], 
> which when parsed by pdfbox, appears to give a truncated content stream. That 
> is, running the following code results in a substantially shorter content 
> stream than I would expect:
> {code:java}
> try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
>   for (PDPage page: doc.getPages()) {
>     String stream = new String(IOUtils.toByteArray(page.getContents()), 
> StandardCharsets.UTF_8);
>     System.out.println(stream);
>   }
> }
> {code}
>  The code outputs the following:
> {noformat}
> q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q 
> {noformat}
> I'll admit that I don't have the strongest of understandings of PDF content 
> streams, but I can fairly confidently say that more than this is required to 
> draw page 1 of the PDF.
> Additionally, you can deduce from the linked issue that, internally, pdfbox 
> is making reference to additional data that isn't contained in the content 
> stream returned from {{page.getContents()}}.
> In my program, I need to find specific substrings in the content stream to 
> locate specific operations and their arguments. To do so, I [wrap 
> {{PDFStreamParser.parseNextToken()}} with queries to 
> {{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52].
>  I do so in order to get the bounds of a token in the content stream, without 
> the need to parse it myself, (allowing {{parseNextToken}} to do the work for 
> me.) When I look at the bounds which these queries give me, they extend 
> further than the length of the content stream returned by 
> {{page.getContents()}}.
> Specifically, one set of these bounds is (19, 313), inclusive. In other 
> words, the token parsed by {{parseNextToken}} corresponds to characters 
> 19-313 (inclusive, 0-based index) of the content stream. But the content 
> stream returned by {{page.getContents()}} doesn't contain 313 characters.
> Hopefully someone can shed some light on this issue for me. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5152) Content Stream Appears Truncated in Specific File

Reply via email to