[
https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530179#comment-16530179
]
Emmeran Seehuber commented on PDFBOX-4184:
------------------------------------------
[~tilman] Regarding estCompressSum() and chooseDataRowToWrite(): This is a
"oracle" (=heuristic) that tries to get the "best" row to write. The idea is to
choose the row representation (i.e. subtracted to byte in above row, subtracted
to byte on the left, and so on) that has most value 0 or at least very small.
This reduces the possible values in the Huffman tree for the ZIP compression
which allows for a better compression and also makes it more likely to have the
same value repeated to have it run length encoded (RLE). Gradients are
"perfectly" compressed with such a scheme.
The default algorithm is not perfect and misses the best possible combination
of rows. But getting the best combination would mean trying all different row
encodings which means 5 * <height of image> combinations. Tools like
[pngcrush|https://pmt.sourceforge.io/pngcrush/] are trying to do so using more
or less brute force search. But that takes time... So this is not suitable for
a generic image writer.
I'm fine with adding heuristics to decide when to use which encoder. But you
can spend ages to get this right, and you will always find cases where the
heuristic will be wrong... The overall idea of this encoder is to get a better
compression for *most* cases, especially when using zip compression level 9.
When modifying estCompressSum() we should run a test on the govdocs corpus and
record the sizes (in e.g. a textfile) and then change the method and record the
sizes after the change. Then we could compare if the change really changes the
overall compression for the better... I won't have time to look into that this
week.
If I understand the PDF spec correctly, it is not possible to write an indexed
image - which would be very nice for small icon like images... Regarding the
impact on openhtmltopdf it's difficult to say. I have a project where I use
openhtmltopdf to generate reports which contain tons of photo images, so this
change is going to improve the file size there. Also if you care about file
size you should ensure that the compression level is always set to 9.
Maybe we should really add an heuristic like "if the image is smaller then e.g.
50x50 pixel _and_ it is default sRGB, just encode it without predictor using
the old sRGB path". For such small images we also could do the brute force way
and encode it using both methods and then choose the smaller result.
Regarding the testCreateLosslessFromImageCMYK(): If the image data is nearly
identical but not exact the same, it's likely some rounding errors because of
the color conversion. Of course, this should not happen, but it also depends
how the image color is converted to sRGB. It would be awesome (not only for
this test, but also for other stuff) if PDImageXObject had a getRawImage() (or
similar named) method, which would return the BufferedImage with whatever
colorspace it has, so that CMYK images just would be returned as CMYK images
and not converted to sRGB.
I'm also thinking about adding a method
LosslessFactory.createFromByteArray(PDDocument,byte[]) which would try and
sniffer the image type. It could use JPEGFactory.createFromByteArray() for
JPEGs and could try to directly reuse the IDAT chunk of PNGs. If it could not
encode the image because the PNG has e.g. a index color encoding, it would
return null, so that the user knows he has to load the image to encode it from
the BufferedImage. This would speed up the PDF write time if the user already
has the image encoded and it would allow to precompress images using external
tools like pngcrush and benefit from that compression. But thats a different
issue.
> [PATCH]: Support simple lossless compression of 16 bit RGB images
> -----------------------------------------------------------------
>
> Key: PDFBOX-4184
> URL: https://issues.apache.org/jira/browse/PDFBOX-4184
> Project: PDFBox
> Issue Type: Improvement
> Components: Writing
> Affects Versions: 2.0.9
> Reporter: Emmeran Seehuber
> Priority: Minor
> Fix For: 2.0.12, 3.0.0 PDFBox
>
> Attachments: 16bit.png, LoadGovdocs.java,
> lossless_predictor_based_imageencoding.patch,
> lossless_predictor_based_imageencoding_v2.patch,
> lossless_predictor_based_imageencoding_v3.patch,
> lossless_predictor_based_imageencoding_v4.patch,
> lossless_predictor_based_imageencoding_v5.patch,
> lossless_predictor_based_imageencoding_v6.patch,
> pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf,
> png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf
>
>
> The attached patch add support to write 16 bit per component images
> correctly. I've integrated a test for this here:
> [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9]
> It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this
> is what you usually get when you read a 16 bit PNG file.
> This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173].
> The patch is against 2.0.9, but should apply to 3.0.0 too.
> There is still some room for improvements when writing lossless images, as
> the images are currently not efficiently encoded. I.e. you could use PNG
> encodings to get a better compression. (By adding a COSName.DECODE_PARMS with
> a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is
> something for a later patch. It would also need another API, as there is a
> tradeoff speed vs compression ratio.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]