[
https://issues.apache.org/jira/browse/PDFBOX-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316921#comment-17316921
]
Steven Fontaine commented on PDFBOX-5152:
-----------------------------------------
I've fixed the issue with my application, so I'll close this issue. Since you
certainly know more about the PDF spec than I do, I wonder if you would be
willing to allow me to briefly explain my solution, so that you can glance over
it and let me know of any glaring issues with it? Also, unrelated, do you
happen to know if there would be any opposition to changing the visibility of
[org.apache.pdfbox.contentstream.PDFStreamEngine#processStreamOperators|https://github.com/apache/pdfbox/blob/66cbed8227048fcc96a5ca4c51d59f0823abe153/pdfbox/src/main/java/org/apache/pdfbox/contentstream/PDFStreamEngine.java#L482]
to {{protected}}? If not, I'm interested in opening an Improvement/Wish
requesting that, so that subclasses of {{PDFStreamEngine}} can override that
method.
Either way, thanks for your help and feedback!
> Content Stream Appears Truncated in Specific File
> -------------------------------------------------
>
> Key: PDFBOX-5152
> URL: https://issues.apache.org/jira/browse/PDFBOX-5152
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.23
> Reporter: Steven Fontaine
> Priority: Minor
>
> I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert
> the colors of a PDF file. An
> [issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which
> provided a [PDF
> file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf],
> which when parsed by pdfbox, appears to give a truncated content stream. That
> is, running the following code results in a substantially shorter content
> stream than I would expect:
> {code:java}
> try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
> for (PDPage page: doc.getPages()) {
> String stream = new String(IOUtils.toByteArray(page.getContents()),
> StandardCharsets.UTF_8);
> System.out.println(stream);
> }
> }
> {code}
> The code outputs the following:
> {noformat}
> q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q
> {noformat}
> I'll admit that I don't have the strongest of understandings of PDF content
> streams, but I can fairly confidently say that more than this is required to
> draw page 1 of the PDF.
> Additionally, you can deduce from the linked issue that, internally, pdfbox
> is making reference to additional data that isn't contained in the content
> stream returned from {{page.getContents()}}.
> In my program, I need to find specific substrings in the content stream to
> locate specific operations and their arguments. To do so, I [wrap
> {{PDFStreamParser.parseNextToken()}} with queries to
> {{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52].
> I do so in order to get the bounds of a token in the content stream, without
> the need to parse it myself, (allowing {{parseNextToken}} to do the work for
> me.) When I look at the bounds which these queries give me, they extend
> further than the length of the content stream returned by
> {{page.getContents()}}.
> Specifically, one set of these bounds is (19, 313), inclusive. In other
> words, the token parsed by {{parseNextToken}} corresponds to characters
> 19-313 (inclusive, 0-based index) of the content stream. But the content
> stream returned by {{page.getContents()}} doesn't contain 313 characters.
> Hopefully someone can shed some light on this issue for me. Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]