Emmeran Seehuber created PDFBOX-4341:
----------------------------------------

             Summary: [Patch] PNGConverter: PNG bytes to PDImageXObject 
converter
                 Key: PDFBOX-4341
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4341
             Project: PDFBox
          Issue Type: Improvement
          Components: Writing
    Affects Versions: 2.0.12
            Reporter: Emmeran Seehuber
         Attachments: pngconvert_testimg.zip, pngconvert_v1.patch

The attached patch implements a PNG bytes to PDImageXObject converter. It tries 
to create a PDImageXObject from the chunks of a PNG image, without 
recompressing it. This allows to use programs like pngcrush and friends to 
embedded optimal compressed images. It’s also way faster than recompressing the 
image.

The class PNGConverter does this in three steps:
 - Parsing the PNG chunk structure from the byte array
 - Validating all relevant data chunks (i.e. checking the CRC). Chunks which 
are not needed (e.g. text chunks) are not validated.
 - Constructing a PDImageXObject from the chunks

When at any of this steps an error occurs or the converter detects that it is 
not possible to map the image, it will bail out and return null. In this case 
the image has to be embedded the „normal“ way by reading it using ImageIO and 
compressing it again.

Only this PNG image types can be converted (at least theoretically) without 
recompressing the image data:
 - Grayscale
 - Truecolor (i.e. RGB 8-Bit/16-Bit)
 - Indexed

As soon as transparency is used it gets difficult:
 - Grayscale with alpha / truecolor with alpha: The alpha channel is saved in 
the image data stream, as they are stored as (Gray,Alpha) or 
(Red,Green,Blue,Alpha) tuples. You have to separate the alpha information for 
the SMASK-Image. At this moment you can just read and recompress it using the 
LosslessFactory.
 - Indexed with alpha. Alpha and color tables are separate in the PNG, so this 
should be possible to build a grayscale SMASK from the image data (which are 
just the table indices) and the alpha table. Tried that, but Acrobat Reader 
does not like indexed SMASKs… One could just build a grayscale SMASK using the 
alpha table and the decompressed image index data. This would at least save 
some space, as the optimized indexed image data is still used.

With the current patch only truecolor without alpha images work correctly. The 
other tests for grayscale and indexed fail. (You must place the zipped images 
in the resources folder were png.png resides to run the testdrivers; This 
images are „original“ work done by me using Gimp, Krita and ImageOptim (on 
macOS) to build the different png image types.)

Notes for the current patch:
 - The grayscale images have the wrong gamma curve. I tried using the 
ColorSpace.CS_GRAY ICC profile and the image seems now only „slightly“ off 
(i.e. pixel value FFD6D6D6 vs FFD7D7D7). As soon as a gAMA chunk is given the 
image is tagged with a CalGray profile, but the colors are way more off then.
 - The cHRM (chroma) chunk is read and *should* work, as I used the formula’s 
from the PDF spec to convert the cRHM values to the CalRGB whitepoint and 
matrix. I have not yet tested this, as I have no test image with cHRM at the 
moment. Note: Matrix(COSArray) and Matrix.toCOSArray() are fine for geometric 
matrices. But this methods are wrong for any other kind of matrix (i.e. color 
transform matrices), as they only store/restore 6 values of the 3x3 matrix. I 
deprecated PDCalRGB.setMatrix(Matrix) because of this, as this was never 
working and can not work as long as the Matrix class is for geometric use cases 
only. This should also be documented on the Matrix class, that it is not 
general purpose. I added a PDCalRGB.setMatrix(COSArray) method to allow to set 
the matrix.
 - The indexed image displays fine in Acrobat Reader, but the test driver fails 
as PDImageXObject.getImage() returns a complete black (everything 0) image. 
Strange, I suspect some error in the PDFBox image decoding.
 - If an image is tagged with sRGB, the builtin Java sRGB ICC profile is 
attached. Theoretically you can use a CalRGB colorspace, but using a ICC color 
profile is likely faster (at least in PDFBox) and more „standard“.

You can also look at this patch on GitHub 
[https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1]
 if you like.

It would be nice if someone could give me some hints with the colorspace 
problems. I will try to reread the specs again, maybe I have missed something. 
But it would be great if someone else who has an idea about colorspaces could 
also take a look into this.

As I have no idea how long it takes to understand why the colors are off for 
grayscale and wrong for indexed, I could prepare a stripped down version of 
this patch, which only contains the working stuff (i.e. truecolor), and would 
just do nothing on the not working cases. What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to