[ 
https://issues.apache.org/jira/browse/PDFBOX-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16663396#comment-16663396
 ] 

Emmeran Seehuber commented on PDFBOX-4341:
------------------------------------------

Sorry, I won't have time the next days to look into this. But this result does 
not surprise me, the image are far from being perfectly compressed:

!image-2018-10-25-09-29-47-251.png|height=200!

Using special tools like [ImageOptim|https://github.com/ImageOptim/ImageOptim] 
which uses 4 different PNG cruncher under the belt you can get the optimum 
compression. You just have to press the "Repeat" button multiply times, till 
you get a cross instead of the checkmark icon. And of course this takes time. 
I've attached [^optimized.zip] with the images optimized.

The main benefit of this patch is embedding speed, as it will (if possible) 
embed the image as is. If the image is perfectly optimized then you get the 
perfectly optimized image in the PDF, and if the image is poorly compressed you 
get exactly that.

The LosslessFactory-predictor based encoder does a good "baseline" job with 
compression. It wont be able to reach the compression possible with ImageOptim 
and such tools, but it will also not take ages.

The thing to test on the patch here is not that it compresses the image better 
than the LosslessFactory, because it does not compress the image at all. It's 
rather about embedding speed and correctness. If the PNGConverter converts a 
PNG image it still should stay the same and be correct.

I also think the idea of PDImageXObject.createFromByteArray() is to save 
loading and recompression, isn't it?

Regarding downloading govdocs: Instead of streaming them you should just 
download them first. I think you hit a server connection timeout. 

I'll try to make some benchmark to better show the benefits, but as I already 
wrote I wont have time the next days.

> [Patch] PNGConverter: PNG bytes to PDImageXObject converter
> -----------------------------------------------------------
>
>                 Key: PDFBOX-4341
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4341
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Writing
>    Affects Versions: 2.0.12
>            Reporter: Emmeran Seehuber
>            Priority: Minor
>         Attachments: 001229.png, 001230.png, 008528.png, 014431.png, 
> 016289.png, 017012.png, 017030.png, 017063.png, 017084.png, 
> image-2018-10-25-09-29-47-251.png, optimized.zip, pngconvert_testimg.zip, 
> pngconvert_v1.patch, pngconvert_v2.patch
>
>
> The attached patch implements a PNG bytes to PDImageXObject converter. It 
> tries to create a PDImageXObject from the chunks of a PNG image, without 
> recompressing it. This allows to use programs like pngcrush and friends to 
> embedded optimal compressed images. It’s also way faster than recompressing 
> the image.
> The class PNGConverter does this in three steps:
>  - Parsing the PNG chunk structure from the byte array
>  - Validating all relevant data chunks (i.e. checking the CRC). Chunks which 
> are not needed (e.g. text chunks) are not validated.
>  - Constructing a PDImageXObject from the chunks
> When at any of this steps an error occurs or the converter detects that it is 
> not possible to map the image, it will bail out and return null. In this case 
> the image has to be embedded the „normal“ way by reading it using ImageIO and 
> compressing it again.
> Only this PNG image types can be converted (at least theoretically) without 
> recompressing the image data:
>  - Grayscale
>  - Truecolor (i.e. RGB 8-Bit/16-Bit)
>  - Indexed
> As soon as transparency is used it gets difficult:
>  - Grayscale with alpha / truecolor with alpha: The alpha channel is saved in 
> the image data stream, as they are stored as (Gray,Alpha) or 
> (Red,Green,Blue,Alpha) tuples. You have to separate the alpha information for 
> the SMASK-Image. At this moment you can just read and recompress it using the 
> LosslessFactory.
>  - Indexed with alpha. Alpha and color tables are separate in the PNG, so 
> this should be possible to build a grayscale SMASK from the image data (which 
> are just the table indices) and the alpha table. Tried that, but Acrobat 
> Reader does not like indexed SMASKs… One could just build a grayscale SMASK 
> using the alpha table and the decompressed image index data. This would at 
> least save some space, as the optimized indexed image data is still used.
> With the current patch only truecolor without alpha images work correctly. 
> The other tests for grayscale and indexed fail. (You must place the zipped 
> images in the resources folder were png.png resides to run the testdrivers; 
> This images are „original“ work done by me using Gimp, Krita and ImageOptim 
> (on macOS) to build the different png image types.)
> Notes for the current patch:
>  - The grayscale images have the wrong gamma curve. I tried using the 
> ColorSpace.CS_GRAY ICC profile and the image seems now only „slightly“ off 
> (i.e. pixel value FFD6D6D6 vs FFD7D7D7). As soon as a gAMA chunk is given the 
> image is tagged with a CalGray profile, but the colors are way more off then.
>  - The cHRM (chroma) chunk is read and *should* work, as I used the formula’s 
> from the PDF spec to convert the cRHM values to the CalRGB whitepoint and 
> matrix. I have not yet tested this, as I have no test image with cHRM at the 
> moment. Note: Matrix(COSArray) and Matrix.toCOSArray() are fine for geometric 
> matrices. But this methods are wrong for any other kind of matrix (i.e. color 
> transform matrices), as they only store/restore 6 values of the 3x3 matrix. I 
> deprecated PDCalRGB.setMatrix(Matrix) because of this, as this was never 
> working and can not work as long as the Matrix class is for geometric use 
> cases only. This should also be documented on the Matrix class, that it is 
> not general purpose. I added a PDCalRGB.setMatrix(COSArray) method to allow 
> to set the matrix.
>  - The indexed image displays fine in Acrobat Reader, but the test driver 
> fails as PDImageXObject.getImage() returns a complete black (everything 0) 
> image. Strange, I suspect some error in the PDFBox image decoding.
>  - If an image is tagged with sRGB, the builtin Java sRGB ICC profile is 
> attached. Theoretically you can use a CalRGB colorspace, but using a ICC 
> color profile is likely faster (at least in PDFBox) and more „standard“.
> You can also look at this patch on GitHub 
> [https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1]
>  if you like.
> It would be nice if someone could give me some hints with the colorspace 
> problems. I will try to reread the specs again, maybe I have missed 
> something. But it would be great if someone else who has an idea about 
> colorspaces could also take a look into this.
> As I have no idea how long it takes to understand why the colors are off for 
> grayscale and wrong for indexed, I could prepare a stripped down version of 
> this patch, which only contains the working stuff (i.e. truecolor), and would 
> just do nothing on the not working cases. What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to