[
https://issues.apache.org/jira/browse/PDFBOX-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416957#comment-17416957
]
Christian Appl commented on PDFBOX-5279:
----------------------------------------
Agreed, this is not talking about lossless compression whatsoever. All those
compressors would prioritize a small file size.
To be upfront about this - as you are talking about "some tool that" - am I
assuming correctly, that you are not convinced, that PDFBox would be the right
place to solve this? :)
As far as I can see, even Adobe products don't offer the means to do something
like this, so I can see why PDFBox would have no reason to provide it.
In that case - no harm in closing the ticket - the library will be really
helpful no matter how I go about it.
Any opinions and tips on how, why and what are welcome anyway!
*Why?*
The question resulted from a certain disappointment concerning the
effectiveness of object compression - it did not reduce the file size by a
really "impressive" measure. The next obvious question was: What else could be
compressed?
In many PDFs images are the obvious choice and JPEG compression was chosen to
produce the most recognizable effect. (Which worked very well for some example
PDFs - accepting the resulting artifacts and quality loss.)
But you are of course right about that - if this shall be implemented, the
compression format would have to be selectable and customizable.
The nice thing about object streaming is, that it does not really change the
contents of the document. The original state could always be regenerated
without losses.
However all compressors that would be discussed in this ticket, should be
deactivated by default and should require the user to actively add or select
them - as most likely the majority would apply a lossy compression.
*Concerning OCR:*
OCR should/could/must be applied before reducing content quality.
*Ideas for further lossless size reductions:*
- Find content that is not compressed and apply a lossless compression (like
FLATE).
- Remove orphaned elements.
I had a really close look at the COSWriter during the development of the other
contributions and I'm almost convinced, that implementing those would be a
major waste of time. I'm not sure whether PDFBox would ever produce orphaned
objects or uncompressed streams...
A further idea, would be to reduce identical elements to a single referencable
resource. However - determining whether two images have identical content could
be really costly - especially for large documents and I'm not convinced whether
I like the idea of multiple pages referencing the same image resource. (for
example)
*Claim:*
Which again leaves me with the initial claim: If your priority was to reduce
the file size, a loss in quality must be accepted.
> PDF compression - Content compression
> -------------------------------------
>
> Key: PDFBOX-5279
> URL: https://issues.apache.org/jira/browse/PDFBOX-5279
> Project: PDFBox
> Issue Type: Improvement
> Components: Rendering, Writing
> Affects Versions: 3.0.0 PDFBox
> Reporter: Christian Appl
> Priority: Major
>
> *TL;DR:*
> This is sort of a follow up ticket for PDFBOX-4952 and discusses, if this
> would be a good idea and which contents could be compressed to further reduce
> the file size of resulting PDF documents.
> I would like to provide a feature like that and possibly I will get the
> assignment to do something like it - but I will not repeat my error to first
> develop a solution and only then to contact you. Previous feedback resulted
> in much easier and better solutions, hence I created this ticket.
> *Basic question:*
> Should PDFBox be able to do something like that?
> All further thoughts are somewhat pointless, if it shall not.
> *Limitations of object streaming and PDFBOX-4952:*
> Object streaming has a chance to reduce the file size of documents containing
> many referencable top level objects, by bundling and compressing them in a
> common stream.
> It also has a chance to increase the filesize for documents containing only
> few of such objects (due to the overhead of object stream creation).
> *Content Compressors:*
> To enhance the compression, originally that ticket contained a suggestion for
> "ContentCompressor"s which should have had the task to recognize compressible
> structures and that should further reduce their size by applying a set of
> rules defined by the "CompressParameters".
> Those compressors should have been integrated in the flow of object stream
> creation/compression.
> As a starting point the following "ContentCompressors" were originally
> suggested:
> - "UnencodedStreamCompressor" - apply FLATE to previously uncompressed
> streams.
> - "ImageCompressor" - apply DCT compression to images, with a configurable
> quality and resolution.
> In the POC both had sort of an effect and reduced the file size drastically
> for some PDFs.
> But when I picked up the shelved code again I was able to create a test case
> for e.g. image compression, where the file size exploded, when trying to
> translate a bmp ImageXObject to a DCT stream.
> Therefore I have to assume - those compressors were fit to be a POC, but do
> not work reliably and as expected.
> *Usecase:*
> The first usecase, that directly comes to mind, are Scanners. Some Scanners
> create huge high res images for scanned pages and bundle them to PDFs, which
> results in large PDFs, that waste a lot of diskspace for questionable
> benefits, also causing issues when trying to further process them (e.g.
> sending them via mail, with a limited attachment size).
> Such scans could benefit from a compression like this.
> *Step back:*
> If you are still with me, let me ask some questions first:
> - If something like that shall be implemented - do these compressors make any
> sense?
> - Is it at all possible for the COSWriter to encounter uncompressed streams,
> or is this compressor entirely superfluous?
> - Is it a good idea to compress images and do you have an idea how to solve
> this?
> - What else could be compressed?
> - Does this already exist for PDFBox - even if it was only a partial solution?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]