[jira] [Comment Edited] (PDFBOX-4341) [Patch] PNGConverter: PNG bytes to PDImageXObject converter

Emmeran Seehuber (JIRA) Sun, 21 Oct 2018 22:54:09 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658346#comment-16658346
 ]


Emmeran Seehuber edited comment on PDFBOX-4341 at 10/22/18 5:53 AM:
--------------------------------------------------------------------

No problem, I know there is also always that strange thing called "real life" :)
 * PDCalRGB: setMatrix(): fixed it using getValues()
 * Intent: You are right, I missed that completely, the values in the PDF are 
of course different than the integers in the PNG... To be honest, the Intent is 
only relevant for PDF readers/printers who try to output the image on devices 
whose gamut has not the full gamut of sRGB in it (nowadays mostly laser 
printers, ink printers can handle sRGB usually without a problem). And as soon 
as an ICC profile is specified at least the Java color management uses the 
intent defined in the ICC profile. You have no other chance than to patch the 
ICC profile bytes if you want an different intent as the one in the ICC profile 
in Java... But setting the rendering intent on the image should not hurt in any 
way. 
 * The "black" indexed image was a "stale PDImageXObject state" bug, because it 
has to reread the image/colorspace as soon as you call setColorSpace() on it. 
I.e. the bug was that PDImageXObject.setColorSpace() has to clear the 
colorSpace and cachedImage field. I've added this to 
PDImageXObject.setColorSpace(). You can't just set the  colorspace parameter to 
the colorSpace field, as the PDColorSpace used for writing has not all state 
initialized that is needed to read an image...
 * I stripped down the patch and fixed some bugs on the way ... the usual "how 
could that work in the first place?!" stuff :)

 * 
 ** TrueColor/sRGB without transparency works
 ** When a gamma value of 1/2.2 is specified it is just ignored, because thats 
the gamma of sRGB.
 ** Indexed images work with and without transparency. I've updated the test 
images and have test files for 1, 2, 4 and 8 bit indexed images with 
transparency. This means if a user optimizes his PNG image and it has not more 
then 256 color/transparent combinations it will usually get indexed while 
optimizing and that will be kept at least for the image data. The SMASK has to 
be rebuilt as an grayscale image as Acrobat Reader does not like indexed 
SMASK's. But this will usually still be smaller and faster than recompressing 
the image.
 ** So "only" grayscale and grayscale/TrueColor with transparency are not 
handled. For grayscale I have no idea how to map the colorspace and when 
transparency is combined in the image data we simple can not map that in the 
PDF without completely recoding it. Also TrueColor with a chroma chunk but 
without a ICC profile is not handled, as I have no idea how to map that color 
space.
 ** I've update the image zip as I had to add some additional test images for 
the indexed image case.

With digitalcorpora you mean the govdoc files? I can run a test over this, but 
I can't yet say when I will have time for it. 


was (Author: rototor):
No problem, I know there is also always that strange thing called "real life" :)
 * PDCalRGB: setMatrix(): fixed it using getValues()
 * Intent: You are right, I missed that complete, the values in the PDF are of 
course different than the integers in the PNG... To be honest, the Intent is 
only relevant for PDF readers/printers who try to output the image on devices 
whose gamut has not the full gamut of sRGB in it (nowadays mostly laser 
printers, ink printers can handle sRGB usually without a problem). And as soon 
as an ICC profile is specified at least the Java color management uses the 
intent defined in the ICC profile.You have no other chance than to patch the 
ICC profile bytes if you want an different intent as the one in the ICC profile 
in Java... But setting the rendering intent on the image should not hurt in any 
way. 
 * The "black" indexed image was a "stale PDImageXObject state" bug, because it 
has to reread the image/colorspace as soon as you call setColorSpace() on it. 
I.e. the bug was that PDImageXObject.setColorSpace() has to clear the 
colorSpace and cachedImage field. I've added this to 
PDImageXObject.setColorSpace(). You can't just set the  colorspace parameter to 
the colorSpace field, as the PDColorSpace used for writing has not all state 
initialized that is needed to read an image...
 * I stripped down the patch and fixed some bugs on the way ... the usual "how 
could that work in the first place?!" stuff :)

 ** TrueColor/sRGB without transparency works
 ** When a gamma value of 1/2.2 is specified it is just ignored, because thats 
the gamma of sRGB.
 ** Indexed images work with and without transparency. I've updated the test 
images and have test files for 1, 2, 4 and 8 bit indexed images with 
transparency. This means if a user optimizes his PNG image and it has not more 
then 256 color/transparent combinations it will usually get indexed while 
optimizing and that will be kept at least for the image data. The SMASK has to 
rebuilt as grayscale image as Acrobat Reader does not like indexed SMASK's. But 
this will usually still be smaller and faster than recompressing the image.
 ** So "only" grayscale and grayscale/TrueColor with transparency are not 
handled. For grayscale I have no idea how to map the colorspace and when 
transparency is combined in the image data we simple can not map that in the 
PDF without a completely recoding it. Also TrueColor with a chroma chunk but 
without a ICC profile is not handled, as I have no idea how to map that color 
space.
 ** I've update the image zip as I had to add some additional test images for 
the indexed image case.

With digitalcorpora you mean the govdoc files? I can run a test over this, but 
I can't yet say when I will have time for it. 

> [Patch] PNGConverter: PNG bytes to PDImageXObject converter
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4341
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4341
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Writing
>    Affects Versions: 2.0.12
>            Reporter: Emmeran Seehuber
>            Priority: Minor
>         Attachments: pngconvert_testimg.zip, pngconvert_v1.patch, 
> pngconvert_v2.patch
>
>
> The attached patch implements a PNG bytes to PDImageXObject converter. It 
> tries to create a PDImageXObject from the chunks of a PNG image, without 
> recompressing it. This allows to use programs like pngcrush and friends to 
> embedded optimal compressed images. It’s also way faster than recompressing 
> the image.
> The class PNGConverter does this in three steps:
>  - Parsing the PNG chunk structure from the byte array
>  - Validating all relevant data chunks (i.e. checking the CRC). Chunks which 
> are not needed (e.g. text chunks) are not validated.
>  - Constructing a PDImageXObject from the chunks
> When at any of this steps an error occurs or the converter detects that it is 
> not possible to map the image, it will bail out and return null. In this case 
> the image has to be embedded the „normal“ way by reading it using ImageIO and 
> compressing it again.
> Only this PNG image types can be converted (at least theoretically) without 
> recompressing the image data:
>  - Grayscale
>  - Truecolor (i.e. RGB 8-Bit/16-Bit)
>  - Indexed
> As soon as transparency is used it gets difficult:
>  - Grayscale with alpha / truecolor with alpha: The alpha channel is saved in 
> the image data stream, as they are stored as (Gray,Alpha) or 
> (Red,Green,Blue,Alpha) tuples. You have to separate the alpha information for 
> the SMASK-Image. At this moment you can just read and recompress it using the 
> LosslessFactory.
>  - Indexed with alpha. Alpha and color tables are separate in the PNG, so 
> this should be possible to build a grayscale SMASK from the image data (which 
> are just the table indices) and the alpha table. Tried that, but Acrobat 
> Reader does not like indexed SMASKs… One could just build a grayscale SMASK 
> using the alpha table and the decompressed image index data. This would at 
> least save some space, as the optimized indexed image data is still used.
> With the current patch only truecolor without alpha images work correctly. 
> The other tests for grayscale and indexed fail. (You must place the zipped 
> images in the resources folder were png.png resides to run the testdrivers; 
> This images are „original“ work done by me using Gimp, Krita and ImageOptim 
> (on macOS) to build the different png image types.)
> Notes for the current patch:
>  - The grayscale images have the wrong gamma curve. I tried using the 
> ColorSpace.CS_GRAY ICC profile and the image seems now only „slightly“ off 
> (i.e. pixel value FFD6D6D6 vs FFD7D7D7). As soon as a gAMA chunk is given the 
> image is tagged with a CalGray profile, but the colors are way more off then.
>  - The cHRM (chroma) chunk is read and *should* work, as I used the formula’s 
> from the PDF spec to convert the cRHM values to the CalRGB whitepoint and 
> matrix. I have not yet tested this, as I have no test image with cHRM at the 
> moment. Note: Matrix(COSArray) and Matrix.toCOSArray() are fine for geometric 
> matrices. But this methods are wrong for any other kind of matrix (i.e. color 
> transform matrices), as they only store/restore 6 values of the 3x3 matrix. I 
> deprecated PDCalRGB.setMatrix(Matrix) because of this, as this was never 
> working and can not work as long as the Matrix class is for geometric use 
> cases only. This should also be documented on the Matrix class, that it is 
> not general purpose. I added a PDCalRGB.setMatrix(COSArray) method to allow 
> to set the matrix.
>  - The indexed image displays fine in Acrobat Reader, but the test driver 
> fails as PDImageXObject.getImage() returns a complete black (everything 0) 
> image. Strange, I suspect some error in the PDFBox image decoding.
>  - If an image is tagged with sRGB, the builtin Java sRGB ICC profile is 
> attached. Theoretically you can use a CalRGB colorspace, but using a ICC 
> color profile is likely faster (at least in PDFBox) and more „standard“.
> You can also look at this patch on GitHub 
> [https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1]
>  if you like.
> It would be nice if someone could give me some hints with the colorspace 
> problems. I will try to reread the specs again, maybe I have missed 
> something. But it would be great if someone else who has an idea about 
> colorspaces could also take a look into this.
> As I have no idea how long it takes to understand why the colors are off for 
> grayscale and wrong for indexed, I could prepare a stripped down version of 
> this patch, which only contains the working stuff (i.e. truecolor), and would 
> just do nothing on the not working cases. What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Comment Edited] (PDFBOX-4341) [Patch] PNGConverter: PNG bytes to PDImageXObject converter

Reply via email to