Steven Fontaine created PDFBOX-5152:
---------------------------------------
Summary: Content Stream Appears Truncated in Specific File
Key: PDFBOX-5152
URL: https://issues.apache.org/jira/browse/PDFBOX-5152
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 2.0.23
Reporter: Steven Fontaine
I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert
the colors of a PDF file. An
[issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which
provided a [PDF
file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf], which
when parsed by pdfbox, appears to give a truncated content stream. That is,
running the following code results in a substantially shorter content stream
than I would expect:
{code:java}
try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
for (PDPage page: doc.getPages()) {
String stream = new String(IOUtils.toByteArray(page.getContents()),
StandardCharsets.UTF_8);
System.out.println(stream);
}
}
{code}
The code outputs the following:
{noformat}
q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q
{noformat}
I'll admit that I don't have the strongest of understandings of PDF content
streams, but I can fairly confidently say that more than this is required to
draw page 1 of the PDF.
Additionally, you can deduce from the linked issue that, internally, pdfbox is
making reference to additional data that isn't contained in the content stream
returned from {{page.getContents()}}.
In my program, I need to find specific substrings in the content stream to
locate specific operations and their arguments. To do so, I [wrap
{{PDFStreamParser.parseNextToken()}} with queries to
{{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52].
I do so in order to get the bounds of a token in the content stream, without
the need to parse it myself, (allowing {{parseNextToken}} to do the work for
me.) When I look at the bounds which these queries give me, they extend further
than the length of the content stream returned by {{page.getContents()}}.
Specifically, one set of these bounds is (19, 313), inclusive. In other words,
the token parsed by {{parseNextToken}} corresponds to characters 19-313
(inclusive, 0-based index) of the content stream. But the content stream
returned by {{page.getContents()}} doesn't contain 313 characters.
Hopefully someone can shed some light on this issue for me. Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]