[ 
https://issues.apache.org/jira/browse/PDFBOX-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417163#comment-17417163
 ] 

Christian Appl edited comment on PDFBOX-5279 at 9/18/21, 3:21 PM:
------------------------------------------------------------------

PDF-Tools you say... True - I think I read something about a PDF "optimizer" or 
something like that some time ago?
 Would have to have a look into that, we actually are already using PDF-Tools 
for PDF/A conversion. Thanks for the lead! Possibly this ticket is obsolete 
then. :)

However - I suppose they won't reduce "oversized" resources.
Sometimes you find really interesting examples for PDFs in the wild - most 
likely will still have to give that a thought.


was (Author: capsvd):
PDF-Tools you say... True - I think I read something about a PDF "optimizer" or 
something like that some time ago?
 Would have to have a look into that, we actually are already using PDF-Tools 
for PDF/A conversion. Thanks for the lead! Possibly this ticket is obsolete 
then. :)

> PDF compression - Content compression
> -------------------------------------
>
>                 Key: PDFBOX-5279
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5279
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Rendering, Writing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Christian Appl
>            Priority: Major
>
> *TL;DR:*
> This is sort of a follow up ticket for PDFBOX-4952 and discusses, if this 
> would be a good idea and which contents could be compressed to further reduce 
> the file size of resulting PDF documents.
> I would like to provide a feature like that and possibly I will get the 
> assignment to do something like it - but I will not repeat my error to first 
> develop a solution and only then to contact you. Previous feedback resulted 
> in much easier and better solutions, hence I created this ticket.
> *Basic question:*
> Should PDFBox be able to do something like that?
> All further thoughts are somewhat pointless, if it shall not.
> *Limitations of object streaming and PDFBOX-4952:*
> Object streaming has a chance to reduce the file size of documents containing 
> many referencable top level objects, by bundling and compressing them in a 
> common stream.
> It also has a chance to increase the filesize for documents containing only 
> few of such objects (due to the overhead of object stream creation).
> *Content Compressors:*
> To enhance the compression, originally that ticket contained a suggestion for 
> "ContentCompressor"s which should have had the task to recognize compressible 
> structures and that should further reduce their size by applying a set of 
> rules defined by the "CompressParameters".
> Those compressors should have been integrated in the flow of object stream 
> creation/compression.
> As a starting point the following "ContentCompressors" were originally 
> suggested:
> - "UnencodedStreamCompressor" - apply FLATE to previously uncompressed 
> streams.
> - "ImageCompressor" - apply DCT compression to images, with a configurable 
> quality and resolution.
> In the POC both had sort of an effect and reduced the file size drastically 
> for some PDFs.
> But when I picked up the shelved code again I was able to create a test case 
> for e.g. image compression, where the file size exploded, when trying to 
> translate a bmp ImageXObject to a DCT stream.
> Therefore I have to assume - those compressors were fit to be a POC, but do 
> not work reliably and as expected.
> *Usecase:*
> The first usecase, that directly comes to mind, are Scanners. Some Scanners 
> create huge high res images for scanned pages and bundle them to PDFs, which 
> results in large PDFs, that waste a lot of diskspace for questionable 
> benefits, also causing issues when trying to further process them (e.g. 
> sending them via mail, with a limited attachment size).
> Such scans could benefit from a compression like this.
> *Step back:*
> If you are still with me, let me ask some questions first:
> - If something like that shall be implemented - do these compressors make any 
> sense?
> - Is it at all possible for the COSWriter to encounter uncompressed streams, 
> or is this compressor entirely superfluous?
> - Is it a good idea to compress images and do you have an idea how to solve 
> this?
> - What else could be compressed?
> - Does this already exist for PDFBox - even if it was only a partial solution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to