[
https://issues.apache.org/jira/browse/PDFBOX-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319774#comment-17319774
]
Steven Fontaine commented on PDFBOX-5152:
-----------------------------------------
Hi Micheal. The lossiness of converting to RGB is something I intend to fix.
It's issue #2 of the project. When I say my process is lossless, I'm comparing
it to most other programs which claim to invert the PDF colors. Most other
programs I've found which claim to do this (for instance ImageMagick) just
convert each page of the PDF to an image and invert that image, losing any
text, embedded images, etc. I should clarify all of this in a readme (issue
#1). That being said, I intend to fix any blending issues or transparency
issues in the future as well. PDFInverter isn't a very mature project yet, and
it's one I plan on refining in time.
> Content Stream Appears Truncated in Specific File
> -------------------------------------------------
>
> Key: PDFBOX-5152
> URL: https://issues.apache.org/jira/browse/PDFBOX-5152
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 2.0.23
> Reporter: Steven Fontaine
> Priority: Minor
>
> I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert
> the colors of a PDF file. An
> [issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which
> provided a [PDF
> file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf],
> which when parsed by pdfbox, appears to give a truncated content stream. That
> is, running the following code results in a substantially shorter content
> stream than I would expect:
> {code:java}
> try (PDDocument doc = PDDocument.load(/* January.pdf */)) {
> for (PDPage page: doc.getPages()) {
> String stream = new String(IOUtils.toByteArray(page.getContents()),
> StandardCharsets.UTF_8);
> System.out.println(stream);
> }
> }
> {code}
> The code outputs the following:
> {noformat}
> q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q
> {noformat}
> I'll admit that I don't have the strongest of understandings of PDF content
> streams, but I can fairly confidently say that more than this is required to
> draw page 1 of the PDF.
> Additionally, you can deduce from the linked issue that, internally, pdfbox
> is making reference to additional data that isn't contained in the content
> stream returned from {{page.getContents()}}.
> In my program, I need to find specific substrings in the content stream to
> locate specific operations and their arguments. To do so, I [wrap
> {{PDFStreamParser.parseNextToken()}} with queries to
> {{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52].
> I do so in order to get the bounds of a token in the content stream, without
> the need to parse it myself, (allowing {{parseNextToken}} to do the work for
> me.) When I look at the bounds which these queries give me, they extend
> further than the length of the content stream returned by
> {{page.getContents()}}.
> Specifically, one set of these bounds is (19, 313), inclusive. In other
> words, the token parsed by {{parseNextToken}} corresponds to characters
> 19-313 (inclusive, 0-based index) of the content stream. But the content
> stream returned by {{page.getContents()}} doesn't contain 313 characters.
> Hopefully someone can shed some light on this issue for me. Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]