[ 
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530179#comment-16530179
 ] 

Emmeran Seehuber commented on PDFBOX-4184:
------------------------------------------

[~tilman] Regarding estCompressSum() and chooseDataRowToWrite(): This is a 
"oracle" (=heuristic) that tries to get the "best" row to write. The idea is to 
choose the row representation (i.e. subtracted to byte in above row, subtracted 
to byte on the left, and so on) that has most value 0 or at least very small. 
This reduces the possible values in the Huffman tree for the ZIP compression 
which allows for a better compression and also makes it more likely to have the 
same value repeated to have it run length encoded (RLE). Gradients are 
"perfectly" compressed with such a scheme. 

The default algorithm is not perfect and misses the best possible combination 
of rows. But getting the best combination would mean trying all different row 
encodings which means 5 * <height of image> combinations. Tools like 
[pngcrush|https://pmt.sourceforge.io/pngcrush/] are trying to do so using more 
or less brute force search. But that takes time... So this is not suitable for 
a generic image writer.

I'm fine with adding heuristics to decide when to use which encoder. But you 
can spend ages to get this right, and you will always find cases where the 
heuristic will be wrong... The overall idea of this encoder is to get a better 
compression for *most* cases, especially when using zip compression level 9. 

When modifying estCompressSum() we should run a test on the govdocs corpus and 
record the sizes (in e.g. a textfile) and then change the method and record the 
sizes after the change. Then we could compare if the change really changes the 
overall compression for the better... I won't have time to look into that this 
week.

If I understand the PDF spec correctly, it is not possible to write an indexed 
image - which would be very nice for small icon like images... Regarding the 
impact on openhtmltopdf it's difficult to say. I have a project where I use 
openhtmltopdf to generate reports which contain tons of photo images, so this 
change is going to improve the file size there. Also if you care about file 
size you should ensure that the compression level is always set to 9.

Maybe we should really add an heuristic like "if the image is smaller then e.g. 
50x50 pixel _and_ it is default sRGB, just encode it without predictor using 
the old sRGB path". For such small images we also could do the brute force way 
and encode it using both methods and then choose the smaller result.

Regarding the testCreateLosslessFromImageCMYK(): If the image data is nearly 
identical but not exact the same, it's likely some rounding errors because of 
the color conversion. Of course, this should not happen, but it also depends 
how the image color is converted to sRGB. It would be awesome (not only for 
this test, but also for other stuff) if PDImageXObject had a getRawImage() (or 
similar named) method, which would return the BufferedImage with whatever 
colorspace it has, so that CMYK images just would be returned as CMYK images 
and not converted to sRGB.

I'm also thinking about adding a method 
LosslessFactory.createFromByteArray(PDDocument,byte[]) which would try and 
sniffer the image type. It could use JPEGFactory.createFromByteArray() for 
JPEGs and could try to directly reuse the IDAT chunk of PNGs. If it could not 
encode the image because the PNG has e.g. a index color encoding, it would 
return null, so that the user knows he has to load the image to encode it from 
the BufferedImage. This would speed up the PDF write time if the user already 
has the image encoded and it would allow to precompress images using external 
tools like pngcrush and benefit from that compression. But thats a different 
issue.

> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -----------------------------------------------------------------
>
>                 Key: PDFBOX-4184
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4184
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Writing
>    Affects Versions: 2.0.9
>            Reporter: Emmeran Seehuber
>            Priority: Minor
>             Fix For: 2.0.12, 3.0.0 PDFBox
>
>         Attachments: 16bit.png, LoadGovdocs.java, 
> lossless_predictor_based_imageencoding.patch, 
> lossless_predictor_based_imageencoding_v2.patch, 
> lossless_predictor_based_imageencoding_v3.patch, 
> lossless_predictor_based_imageencoding_v4.patch, 
> lossless_predictor_based_imageencoding_v5.patch, 
> lossless_predictor_based_imageencoding_v6.patch, 
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, 
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images 
> correctly. I've integrated a test for this here: 
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this 
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as 
> the images are currently not efficiently encoded. I.e. you could use PNG 
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with 
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is 
> something for a later patch. It would also need another API, as there is a 
> tradeoff speed vs compression ratio. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to