[jira] [Commented] (PDFBOX-4186) Add quality option for compressed images to pdfbox-app
[ https://issues.apache.org/jira/browse/PDFBOX-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429831#comment-16429831 ] Martin Hausner commented on PDFBOX-4186: Danke! Thank you for superfast improvement :) > Add quality option for compressed images to pdfbox-app > -- > > Key: PDFBOX-4186 > URL: https://issues.apache.org/jira/browse/PDFBOX-4186 > Project: PDFBox > Issue Type: Improvement > Components: Utilities >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Martin Hausner >Assignee: Tilman Hausherr >Priority: Major > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: pdfbox-tool.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Add commandline option *quality* option for compressed images to pdfbox-app > ex: -quality 0.75 > see [^pdfbox-tool.patch] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429822#comment-16429822 ] Emmeran Seehuber commented on PDFBOX-4184: -- Oh yes, you are right. And I totally overlooked that the getRGB() used always converts into sRGB ... I already do colorspace tagging in [https://github.com/rototor/pdfbox-graphics2d/blob/master/src/main/java/de/rototor/pdfbox/graphics2d/PdfBoxGraphics2DLosslessImageEncoder.java] {code:java} /* * Do we have a color profile we need to embed? */ if (bi.getColorModel().getColorSpace() instanceof ICC_ColorSpace) { ICC_Profile profile = ((ICC_ColorSpace) bi.getColorModel().getColorSpace()).getProfile(); /* * Only tag a profile if it is not the default sRGB profile. */ if (((ICC_ColorSpace) bi.getColorModel().getColorSpace()).getProfile() != ICC_Profile .getInstance(ColorSpace.CS_sRGB)) { SoftReference pdProfileRef = profileMap.get(new ProfileSoftReference(profile)); PDICCBased pdProfile = pdProfileRef == null ? null : pdProfileRef.get(); if (pdProfile == null) { pdProfile = new PDICCBased(document); OutputStream outputStream = pdProfile.getPDStream() .createOutputStream(COSName.FLATE_DECODE); outputStream.write(profile.getData()); outputStream.close(); pdProfile.getPDStream().getCOSObject().setInt(COSName.N, profile.getNumComponents()); profileMap.put(new ProfileSoftReference(profile), new SoftReference(pdProfile)); } imageXObject.setColorSpace(pdProfile); } } {code} which is of course stupid if the color always get converted to sRGB Its not only stupid, but also wrong, because it causes color shifts ... argh So at the moment PDFBox is not usably for any "real" prepress stuff, as the sRGB colorspace is way to small. (At the moment i still use iText 2.1 for my prepress stuff, but I want to get rid of it in the long term) sRGB as used at the moment in the LosslessFactory is fine for web / display only PDFs. But for prepress not so much Hmm, I should really try to find some time to implement a "ImageEncoderFactory" and implement all different encodings correctly (which are mostly 8-bit and 16-bit images, everything with less bit depth is likely fine with getRGB() as now - and of course not only encode RGB but also encode CMYK...). (No, I wont use any code of iText; They have tons of special hacks to e.g. reuse already encoded PNG data etc which I think is not worth the effort and way to complex / to much code). I have a factory with an API like this in mind: (everything with method chaining) {code:java} ImageEncoder myEncoder = ImageEncoderFactory.newBuilder(pdDocument) // Lossy / JPEG quality 0.9 .jpeg(0.9) // or lossless .lossless() // Lossless Compression the fast way with a not so great compression ratio like at the moment .fastCompression() // Lossless Compression the slow way with maximum possible compression ratio (using predictors etc.) .slowCompression() // Set conversion to sRGB 8-Bit. Default would be to always use the color space / ICC Profile of the image. .toSRGB() // and finally .build(); PDImage pdImg = myEncoder.encode(img); PDImage pdImg2 = myEncoder.encode(img2); // ... reuse myEncoder as much as possible, but not multithreaded{code} What do you think? > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: pdfbox_support_16bit_image_write.patch,
[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429719#comment-16429719 ] Tilman Hausherr commented on PDFBOX-4184: - I wonder if the patch code is correct - it takes the raster values directly without doing any conversions for ICC colorspaces. > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: pdfbox_support_16bit_image_write.patch, > png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, > png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429707#comment-16429707 ] Tilman Hausherr commented on PDFBOX-4184: - I found the cause of the bug from the github issue, it is in {{createAlphaFromARGBImage}}, the line {{bos.write(pixel)}}. For 16 bit images it should be changed to {{bos.write(pixel / 256)}}. So the existing code should be changed to {code} else { bpc = 8; int dataType = alphaRaster.getDataBuffer().getDataType(); if (dataType == DataBuffer.TYPE_USHORT) { for (int pixel : pixels) { bos.write(pixel / 256); } } else { for (int pixel : pixels) { bos.write(pixel); } } } {code} Sadly this doesn't explain why I can't produce a test that fails... I did make tries with alpha values and nothing weird happened. > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: pdfbox_support_16bit_image_write.patch, > png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, > png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429702#comment-16429702 ] Tilman Hausherr commented on PDFBOX-4184: - The two last files (no smask) show that the bug is in the smask creation. The RGB images are identical visually (but different in bit size). > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: pdfbox_support_16bit_image_write.patch, > png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, > png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4184: Attachment: png16-arrow-good-no-mask.pdf png16-arrow-bad-no-smask.pdf > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: pdfbox_support_16bit_image_write.patch, > png16-arrow-bad-no-smask.pdf, png16-arrow-bad.pdf, > png16-arrow-good-no-mask.pdf, png16-arrow-good.pdf > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4184) [PATCH]: Support simple lossless compression of 16 bit RGB images
[ https://issues.apache.org/jira/browse/PDFBOX-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated PDFBOX-4184: Attachment: png16-arrow-good.pdf png16-arrow-bad.pdf > [PATCH]: Support simple lossless compression of 16 bit RGB images > - > > Key: PDFBOX-4184 > URL: https://issues.apache.org/jira/browse/PDFBOX-4184 > Project: PDFBox > Issue Type: Improvement > Components: Writing >Affects Versions: 2.0.9 >Reporter: Emmeran Seehuber >Priority: Minor > Fix For: 2.0.10, 3.0.0 PDFBox > > Attachments: pdfbox_support_16bit_image_write.patch, > png16-arrow-bad.pdf, png16-arrow-good.pdf > > > The attached patch add support to write 16 bit per component images > correctly. I've integrated a test for this here: > [https://github.com/rototor/pdfbox-graphics2d/commit/8bf089cb74945bd4f0f15054754f51dd5b361fe9] > It only supports 16-Bit TYPE_CUSTOM with DataType == USHORT images - but this > is what you usually get when you read a 16 bit PNG file. > This would also fix [https://github.com/danfickle/openhtmltopdf/issues/173]. > The patch is against 2.0.9, but should apply to 3.0.0 too. > There is still some room for improvements when writing lossless images, as > the images are currently not efficiently encoded. I.e. you could use PNG > encodings to get a better compression. (By adding a COSName.DECODE_PARMS with > a COSName.PREDICTOR == 15 and encoding the images as PNG). But this is > something for a later patch. It would also need another API, as there is a > tradeoff speed vs compression ratio. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429669#comment-16429669 ] Maruan Sahyoun commented on PDFBOX-4182: Thanks - I did some special merge implementation which works wo leaving the files open but is for a very specific set of PDFs (merging over 1 docs in one go) - so maybe we find a way to also deal with the issues which currently prevent us from doing it. OTOH if the resulting file is large it will still need lots of memory. We could take a look at memory mapped files for caching. [~pasfilip] would it be possible to share a small set of your documents to get an idea which PDF elements they use? > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4004) Elements in the structure tree are not removed or corrected when flattening
[ https://issues.apache.org/jira/browse/PDFBOX-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429656#comment-16429656 ] Maruan Sahyoun commented on PDFBOX-4004: [~tilman] I'll take a look after doing PDFBOX-3809 > Elements in the structure tree are not removed or corrected when flattening > --- > > Key: PDFBOX-4004 > URL: https://issues.apache.org/jira/browse/PDFBOX-4004 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.8 >Reporter: Tilman Hausherr >Priority: Major > Labels: StructureTree, flatten > Attachments: GovFormPreFlattened.pdf > > > When flattening, the elements in the structure tree are not removed nor > adjusted (to the form xobject). An example can be found at > {{Root/StructTreeRoot/ParentTree/Nums/\[31]/K/Obj}} in the file > GovFormPreFlattened.pdf . This links to something that does not really exist > anymore. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Assigned] (PDFBOX-4004) Elements in the structure tree are not removed or corrected when flattening
[ https://issues.apache.org/jira/browse/PDFBOX-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun reassigned PDFBOX-4004: -- Assignee: Maruan Sahyoun > Elements in the structure tree are not removed or corrected when flattening > --- > > Key: PDFBOX-4004 > URL: https://issues.apache.org/jira/browse/PDFBOX-4004 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.8 >Reporter: Tilman Hausherr >Assignee: Maruan Sahyoun >Priority: Major > Labels: StructureTree, flatten > Attachments: GovFormPreFlattened.pdf > > > When flattening, the elements in the structure tree are not removed nor > adjusted (to the form xobject). An example can be found at > {{Root/StructTreeRoot/ParentTree/Nums/\[31]/K/Obj}} in the file > GovFormPreFlattened.pdf . This links to something that does not really exist > anymore. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Updated] (PDFBOX-4004) Elements in the structure tree are not removed or corrected when flattening
[ https://issues.apache.org/jira/browse/PDFBOX-4004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-4004: --- Component/s: AcroForm > Elements in the structure tree are not removed or corrected when flattening > --- > > Key: PDFBOX-4004 > URL: https://issues.apache.org/jira/browse/PDFBOX-4004 > Project: PDFBox > Issue Type: Bug > Components: AcroForm >Affects Versions: 2.0.8 >Reporter: Tilman Hausherr >Priority: Major > Labels: StructureTree, flatten > Attachments: GovFormPreFlattened.pdf > > > When flattening, the elements in the structure tree are not removed nor > adjusted (to the form xobject). An example can be found at > {{Root/StructTreeRoot/ParentTree/Nums/\[31]/K/Obj}} in the file > GovFormPreFlattened.pdf . This links to something that does not really exist > anymore. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org