[ 
https://issues.apache.org/jira/browse/PDFBOX-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emmeran Seehuber updated PDFBOX-4341:
-------------------------------------
    Attachment: image-2018-10-25-09-29-47-251.png

> [Patch] PNGConverter: PNG bytes to PDImageXObject converter
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4341
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4341
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Writing
>    Affects Versions: 2.0.12
>            Reporter: Emmeran Seehuber
>            Priority: Minor
>         Attachments: 001229.png, 001230.png, 008528.png, 014431.png, 
> 016289.png, 017012.png, 017030.png, 017063.png, 017084.png, 
> image-2018-10-25-09-29-47-251.png, pngconvert_testimg.zip, 
> pngconvert_v1.patch, pngconvert_v2.patch
>
>
> The attached patch implements a PNG bytes to PDImageXObject converter. It 
> tries to create a PDImageXObject from the chunks of a PNG image, without 
> recompressing it. This allows to use programs like pngcrush and friends to 
> embedded optimal compressed images. It’s also way faster than recompressing 
> the image.
> The class PNGConverter does this in three steps:
>  - Parsing the PNG chunk structure from the byte array
>  - Validating all relevant data chunks (i.e. checking the CRC). Chunks which 
> are not needed (e.g. text chunks) are not validated.
>  - Constructing a PDImageXObject from the chunks
> When at any of this steps an error occurs or the converter detects that it is 
> not possible to map the image, it will bail out and return null. In this case 
> the image has to be embedded the „normal“ way by reading it using ImageIO and 
> compressing it again.
> Only this PNG image types can be converted (at least theoretically) without 
> recompressing the image data:
>  - Grayscale
>  - Truecolor (i.e. RGB 8-Bit/16-Bit)
>  - Indexed
> As soon as transparency is used it gets difficult:
>  - Grayscale with alpha / truecolor with alpha: The alpha channel is saved in 
> the image data stream, as they are stored as (Gray,Alpha) or 
> (Red,Green,Blue,Alpha) tuples. You have to separate the alpha information for 
> the SMASK-Image. At this moment you can just read and recompress it using the 
> LosslessFactory.
>  - Indexed with alpha. Alpha and color tables are separate in the PNG, so 
> this should be possible to build a grayscale SMASK from the image data (which 
> are just the table indices) and the alpha table. Tried that, but Acrobat 
> Reader does not like indexed SMASKs… One could just build a grayscale SMASK 
> using the alpha table and the decompressed image index data. This would at 
> least save some space, as the optimized indexed image data is still used.
> With the current patch only truecolor without alpha images work correctly. 
> The other tests for grayscale and indexed fail. (You must place the zipped 
> images in the resources folder were png.png resides to run the testdrivers; 
> This images are „original“ work done by me using Gimp, Krita and ImageOptim 
> (on macOS) to build the different png image types.)
> Notes for the current patch:
>  - The grayscale images have the wrong gamma curve. I tried using the 
> ColorSpace.CS_GRAY ICC profile and the image seems now only „slightly“ off 
> (i.e. pixel value FFD6D6D6 vs FFD7D7D7). As soon as a gAMA chunk is given the 
> image is tagged with a CalGray profile, but the colors are way more off then.
>  - The cHRM (chroma) chunk is read and *should* work, as I used the formula’s 
> from the PDF spec to convert the cRHM values to the CalRGB whitepoint and 
> matrix. I have not yet tested this, as I have no test image with cHRM at the 
> moment. Note: Matrix(COSArray) and Matrix.toCOSArray() are fine for geometric 
> matrices. But this methods are wrong for any other kind of matrix (i.e. color 
> transform matrices), as they only store/restore 6 values of the 3x3 matrix. I 
> deprecated PDCalRGB.setMatrix(Matrix) because of this, as this was never 
> working and can not work as long as the Matrix class is for geometric use 
> cases only. This should also be documented on the Matrix class, that it is 
> not general purpose. I added a PDCalRGB.setMatrix(COSArray) method to allow 
> to set the matrix.
>  - The indexed image displays fine in Acrobat Reader, but the test driver 
> fails as PDImageXObject.getImage() returns a complete black (everything 0) 
> image. Strange, I suspect some error in the PDFBox image decoding.
>  - If an image is tagged with sRGB, the builtin Java sRGB ICC profile is 
> attached. Theoretically you can use a CalRGB colorspace, but using a ICC 
> color profile is likely faster (at least in PDFBox) and more „standard“.
> You can also look at this patch on GitHub 
> [https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1]
>  if you like.
> It would be nice if someone could give me some hints with the colorspace 
> problems. I will try to reread the specs again, maybe I have missed 
> something. But it would be great if someone else who has an idea about 
> colorspaces could also take a look into this.
> As I have no idea how long it takes to understand why the colors are off for 
> grayscale and wrong for indexed, I could prepare a stripped down version of 
> this patch, which only contains the working stuff (i.e. truecolor), and would 
> just do nothing on the not working cases. What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to