Christian Appl created PDFBOX-5279:
--------------------------------------

             Summary: PDF compression - Content compression
                 Key: PDFBOX-5279
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5279
             Project: PDFBox
          Issue Type: Improvement
          Components: Rendering, Writing
    Affects Versions: 3.0.0 PDFBox
            Reporter: Christian Appl


*TL;DR:*
This is sort of a follow up ticket for PDFBOX-4952 and discusses, if this would 
be a good idea and which contents could be compressed to further reduce the 
file size of resulting PDF documents.

I would like to provide a feature like that and possibly I will get the 
assignment to do something like it - but I will not repeat my error to first 
develop a solution and only then to contact you. Previous feedback resulted in 
much easier and better solutions, hence I created this ticket.

*Basic question:*
Should PDFBox be able to do something like that?
All further thoughts are somewhat pointless, if it shall not.

*Limitations of object streaming and PDFBOX-4952:*
Object streaming has a chance to reduce the file size of documents containing 
many referencable top level objects, by bundling and compressing them in a 
common stream.

It also has a chance to increase the filesize for documents containing only few 
of such objects (due to the overhead of object stream creation).

*Content Compressors:*
To enhance the compression, originally that ticket contained a suggestion for 
"ContentCompressor"s which should have had the task to recognize compressible 
structures and that should further reduce their size by applying a set of rules 
defined by the "CompressParameters".
Those compressors should have been integrated in the flow of object stream 
creation/compression.

As a starting point the following "ContentCompressors" were originally 
suggested:
- "UnencodedStreamCompressor" - apply FLATE to previously uncompressed streams.
- "ImageCompressor" - apply DCT compression to images, with a configurable 
quality and resolution.

In the POC both had sort of an effect and reduced the file size drastically for 
some PDFs.
But when I picked up the shelved code again I was able to create a test case 
for e.g. image compression, where the file size exploded, when trying to 
translate a bmp ImageXObject to a DCT stream.
Therefore I have to assume - those compressors were fit to be a POC, but do not 
work reliably and as expected.

*Usecase:*
The first usecase, that directly comes to mind, are Scanners. Some Scanners 
create huge high res images for scanned pages and bundle them to PDFs, which 
results in large PDFs, that waste a lot of diskspace for questionable benefits, 
also causing issues when trying to further process them (e.g. sending them via 
mail, with a limited attachment size).

Such scans could benefit from a compression like this.

*Step back:*
If you are still with me, let me ask some questions first:
- If something like that shall be implemented - do these compressors make any 
sense?
- Is it at all possible for the COSWriter to encounter uncompressed streams, or 
is this compressor entirely superfluous?
- Is it a good idea to compress images and do you have an idea how to solve 
this?

- What else could be compressed?
- Does this already exist for PDFBox - even if it was only a partial solution?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to