Emmeran Seehuber created PDFBOX-4341:
----------------------------------------
Summary: [Patch] PNGConverter: PNG bytes to PDImageXObject
converter
Key: PDFBOX-4341
URL: https://issues.apache.org/jira/browse/PDFBOX-4341
Project: PDFBox
Issue Type: Improvement
Components: Writing
Affects Versions: 2.0.12
Reporter: Emmeran Seehuber
Attachments: pngconvert_testimg.zip, pngconvert_v1.patch
The attached patch implements a PNG bytes to PDImageXObject converter. It tries
to create a PDImageXObject from the chunks of a PNG image, without
recompressing it. This allows to use programs like pngcrush and friends to
embedded optimal compressed images. It’s also way faster than recompressing the
image.
The class PNGConverter does this in three steps:
- Parsing the PNG chunk structure from the byte array
- Validating all relevant data chunks (i.e. checking the CRC). Chunks which
are not needed (e.g. text chunks) are not validated.
- Constructing a PDImageXObject from the chunks
When at any of this steps an error occurs or the converter detects that it is
not possible to map the image, it will bail out and return null. In this case
the image has to be embedded the „normal“ way by reading it using ImageIO and
compressing it again.
Only this PNG image types can be converted (at least theoretically) without
recompressing the image data:
- Grayscale
- Truecolor (i.e. RGB 8-Bit/16-Bit)
- Indexed
As soon as transparency is used it gets difficult:
- Grayscale with alpha / truecolor with alpha: The alpha channel is saved in
the image data stream, as they are stored as (Gray,Alpha) or
(Red,Green,Blue,Alpha) tuples. You have to separate the alpha information for
the SMASK-Image. At this moment you can just read and recompress it using the
LosslessFactory.
- Indexed with alpha. Alpha and color tables are separate in the PNG, so this
should be possible to build a grayscale SMASK from the image data (which are
just the table indices) and the alpha table. Tried that, but Acrobat Reader
does not like indexed SMASKs… One could just build a grayscale SMASK using the
alpha table and the decompressed image index data. This would at least save
some space, as the optimized indexed image data is still used.
With the current patch only truecolor without alpha images work correctly. The
other tests for grayscale and indexed fail. (You must place the zipped images
in the resources folder were png.png resides to run the testdrivers; This
images are „original“ work done by me using Gimp, Krita and ImageOptim (on
macOS) to build the different png image types.)
Notes for the current patch:
- The grayscale images have the wrong gamma curve. I tried using the
ColorSpace.CS_GRAY ICC profile and the image seems now only „slightly“ off
(i.e. pixel value FFD6D6D6 vs FFD7D7D7). As soon as a gAMA chunk is given the
image is tagged with a CalGray profile, but the colors are way more off then.
- The cHRM (chroma) chunk is read and *should* work, as I used the formula’s
from the PDF spec to convert the cRHM values to the CalRGB whitepoint and
matrix. I have not yet tested this, as I have no test image with cHRM at the
moment. Note: Matrix(COSArray) and Matrix.toCOSArray() are fine for geometric
matrices. But this methods are wrong for any other kind of matrix (i.e. color
transform matrices), as they only store/restore 6 values of the 3x3 matrix. I
deprecated PDCalRGB.setMatrix(Matrix) because of this, as this was never
working and can not work as long as the Matrix class is for geometric use cases
only. This should also be documented on the Matrix class, that it is not
general purpose. I added a PDCalRGB.setMatrix(COSArray) method to allow to set
the matrix.
- The indexed image displays fine in Acrobat Reader, but the test driver fails
as PDImageXObject.getImage() returns a complete black (everything 0) image.
Strange, I suspect some error in the PDFBox image decoding.
- If an image is tagged with sRGB, the builtin Java sRGB ICC profile is
attached. Theoretically you can use a CalRGB colorspace, but using a ICC color
profile is likely faster (at least in PDFBox) and more „standard“.
You can also look at this patch on GitHub
[https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1]
if you like.
It would be nice if someone could give me some hints with the colorspace
problems. I will try to reread the specs again, maybe I have missed something.
But it would be great if someone else who has an idea about colorspaces could
also take a look into this.
As I have no idea how long it takes to understand why the colors are off for
grayscale and wrong for indexed, I could prepare a stripped down version of
this patch, which only contains the working stuff (i.e. truecolor), and would
just do nothing on the not working cases. What do you think?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]