Christian Appl created PDFBOX-5279:
--------------------------------------
Summary: PDF compression - Content compression
Key: PDFBOX-5279
URL: https://issues.apache.org/jira/browse/PDFBOX-5279
Project: PDFBox
Issue Type: Improvement
Components: Rendering, Writing
Affects Versions: 3.0.0 PDFBox
Reporter: Christian Appl
*TL;DR:*
This is sort of a follow up ticket for PDFBOX-4952 and discusses, if this would
be a good idea and which contents could be compressed to further reduce the
file size of resulting PDF documents.
I would like to provide a feature like that and possibly I will get the
assignment to do something like it - but I will not repeat my error to first
develop a solution and only then to contact you. Previous feedback resulted in
much easier and better solutions, hence I created this ticket.
*Basic question:*
Should PDFBox be able to do something like that?
All further thoughts are somewhat pointless, if it shall not.
*Limitations of object streaming and PDFBOX-4952:*
Object streaming has a chance to reduce the file size of documents containing
many referencable top level objects, by bundling and compressing them in a
common stream.
It also has a chance to increase the filesize for documents containing only few
of such objects (due to the overhead of object stream creation).
*Content Compressors:*
To enhance the compression, originally that ticket contained a suggestion for
"ContentCompressor"s which should have had the task to recognize compressible
structures and that should further reduce their size by applying a set of rules
defined by the "CompressParameters".
Those compressors should have been integrated in the flow of object stream
creation/compression.
As a starting point the following "ContentCompressors" were originally
suggested:
- "UnencodedStreamCompressor" - apply FLATE to previously uncompressed streams.
- "ImageCompressor" - apply DCT compression to images, with a configurable
quality and resolution.
In the POC both had sort of an effect and reduced the file size drastically for
some PDFs.
But when I picked up the shelved code again I was able to create a test case
for e.g. image compression, where the file size exploded, when trying to
translate a bmp ImageXObject to a DCT stream.
Therefore I have to assume - those compressors were fit to be a POC, but do not
work reliably and as expected.
*Usecase:*
The first usecase, that directly comes to mind, are Scanners. Some Scanners
create huge high res images for scanned pages and bundle them to PDFs, which
results in large PDFs, that waste a lot of diskspace for questionable benefits,
also causing issues when trying to further process them (e.g. sending them via
mail, with a limited attachment size).
Such scans could benefit from a compression like this.
*Step back:*
If you are still with me, let me ask some questions first:
- If something like that shall be implemented - do these compressors make any
sense?
- Is it at all possible for the COSWriter to encounter uncompressed streams, or
is this compressor entirely superfluous?
- Is it a good idea to compress images and do you have an idea how to solve
this?
- What else could be compressed?
- Does this already exist for PDFBox - even if it was only a partial solution?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]