[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16473018#comment-16473018 ]
Emmeran Seehuber commented on PDFBOX-4184: ------------------------------------------ The Govdocs corpus is a little bit big ... I'll let those tests run in the office on Monday, as my iMac there is faster to process that many documents... Regarding directly using the DeflaterOutputStream: I do this to be able to *stream* compress the image data, so that the image data is compressed row by row. This leads to less memory used while compressing and better CPU cache usage (as the data of one row is still in cache when it's fed to zip, in opposite to first encode the image in one big byte buffer (which means doubling the needed memory for the image) and then compressing it at the end. Of course when constructing a DeflateOutputStream it should use the Filter.SYSPROP_DEFLATELEVEL setting. I've refactored the code for this into its own method in Filter.getCompressionLevel(). See the updated patch. [^lossless_predictor_based_imageencoding_v2.patch] - This is as still work in progress, not to be commited yet (need to analyze those image mismatches in the govdocs first) > [PATCH]: Support simple lossless compression of 16 bit RGB images > ----------------------------------------------------------------- > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing > Affects Versions: 2.0.9 > Reporter: Emmeran Seehuber > Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: LoadGovdocs.java, > lossless_predictor_based_imageencoding.patch, > lossless_predictor_based_imageencoding_v2.patch, > pdfbox_support_16bit_image_write.patch, png16-arrow-bad-no-smask.pdf, > png16-arrow-bad.pdf, png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org