[ 
https://issues.apache.org/jira/browse/PDFBOX-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191086#comment-17191086
 ] 

Christian Appl commented on PDFBOX-4952:
----------------------------------------

I am glad to hear, that you are interested!

- Hmmm - I had a brief look at PDFBOX-45 I don't know if my compression would 
support or conflict with incremental saves. As only generation 0 objects may be 
added to object streams, this should not be the case. I agree that I will have 
to test and check that, when you are finished.

- I will do my very best to not break anything, and already had my fair share 
of, bugs and mishaps, while testing this feature on our end... This must not 
cause issues and my changes shall and must not influence the operation of the 
default Document saving at all. I tried to minimize changes to COSWriter, but I 
am afraid that one can not claim, that there are only few and minor changes to 
that class...

- The first questions that I should ask at this point: Should this touch 
COSWriter at all? Should maybe an entirely isolated COSCompressionWriter be 
created instead, to minimize possible problems?

- What do you mean by "<..>"? Do you mean generic types?

- The formatting will be checked.

- The DCT compression is not directly required for this to work and could be 
stalled for a later commit.

 Compressing images is the most easy and promising route to reduce document 
size, as images are often rather large resources and for many documents this 
will directly result in immense drops in filesize.

Compressing uncompressed streams is a minor optimization and does _sometimes_ 
result in worthwhile changes to document size - assumed, that such streams are 
present in the document.

Objectstreams however are highly dependend on the number of objects used in a 
document. For high objectcounts the filesize difference is definately worth it. 
But it must be said: for very, very basic and minimalistic documents, containg 
only few compressible COSObjects, object streaming can be a loss (due to the 
overhead). Most document's however will profit from object streaming, but it 
must be said, that the effect of this alone might not be as relevant as one (I 
- initially) might expect.

> PDF compression - object stream creation
> ----------------------------------------
>
>                 Key: PDFBOX-4952
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4952
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>    Affects Versions: 2.0.21
>            Reporter: Christian Appl
>            Priority: Major
>
> I implemented a basic starting point to realize a PDF compression based on 
> PDFBox 2.0.22-SNAPSHOT
> I want to use this ticket, to ask if you would be interested in such a 
> feature and whether you would be interested to merge it into PDFBox.
> This is sort of a POC, only implementing some very basic functionality, that 
> surely must and could be extended further and it does only implement some 
> very basic and simplistic Unit Tests.
>  However it is able to reduce the size of resulting documents, and creates 
> objectstreams as defined in the PDF reference manual.
> *What it currently does:*
>  It provides the bundling and compression of objects to objectstreams and 
> further applies simple content compression to a small selection of contents.
> To realize content compression, it provides a simple interface and abstract 
> class for "ContentCompressor"s which search a document for specific content, 
> that could be compressed and do compress that contents.
> Currently two content compressors exist:
>  _ImageCompressor_
>  Searches for simple images, that could be compressed using DCT.
> _UnencodedStreamCompressor_
>  Searches the document for yet unencoded streams and applies a Flate 
> compression where necessary.
> Both compressors can be parameterized using a centralized 
> "CompressParameters" instance which is passed to a new "saveCompressed" 
> method of PDDocument.
> The compression is based on, modifies and is realized by a set of extensions 
> for the "COSWriter" class. Basically it organizes objects, that are passed to 
> the COSWriter in objectStreams and applies content optimization where 
> necessary and possible.
> Currently this does support encryption, but does not support linearization of 
> the compressed documents.
> *Caveat:*
>  If this feature is interesting to you, then I would not expect you to simply 
> merge this fork into 2.0.22. I am expecting that you would like to have some 
> details and concepts changed and am ready to implement changes that would be 
> required for this to work to your liking.
> *POC:*
>  4 resulting documents can be found in "target/test-output/compression" when 
> "COSDocumentCompressionTest" is run.
> *The Pull request can be found on Github at:*
>  [https://github.com/apache/pdfbox/pull/86]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to