[jira] [Comment Edited] (PDFBOX-4952) PDF compression - object stream creation

Christian Appl (Jira) Tue, 17 Aug 2021 01:57:13 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400261#comment-17400261
 ]


Christian Appl edited comment on PDFBOX-4952 at 8/17/21, 8:56 AM:
------------------------------------------------------------------

*TL;DR:* 
 In the following I'm asking the questions: Shall 
"PDDocument.saveIncremental()" also support object compression? Which I 
possibly present a solution for.

*First of all:*
Sorry for my absens from this ticket and thank you for considering to make this 
the default behaviour for PDFBox 3!
 I only now found the opportunity to have a look at the current trunk (RC 
PDFBox 3).

As previously stated, I wanted to have a look into incrementally saving 
(signing) the document and whether this does work correctly for encryption next.
 Encryption seems to work just fine, but none of the PDDocument 
"saveIncremental" methods does support CompressParameters (as of now). I 
experimented with that feature and destroyed some signed documents on the way - 
which is resulting in a number of questions and perhaps some solutions:

*What about signatures?*
 The first issue one encounters is that, when activating object compression for 
incremental saving, if the signature itself is compressed, it will be invalid 
in the resulting document.
 Normally the signature's dictionary would call the method 
"COSWriter.visitFromDictionary()" which then would initialize the 
"signatureOffset", "signatureLength" etc. via:

_COSWriter.visitFromDictionary()_
 !image-2021-08-17-10-07-33-682.png!
 But, If the signature is compressed it will be part of the object stream and 
will never call said method, those fields will be left uninitialized and 
therefore "doWriteSignature()" will never be called, because of:

_COSWriter.visitFromDocument()_
 !image-2021-08-17-10-10-21-418.png!

*This however can be solved:*
 I had a look into Adobe DC and evaluated how it would handle this situation.

According to PDF 32000-1 a signature dictionary is not listed as one of the 
structures, that must never be compressed, it should be compressible and could 
be included in an object stream. However the reference manual also states, that 
it is up to the implementing writer to decide which objects to compress, in 
which order to compress them etc.
 I found that Adobe DC does not compress signature dictionaries (even when 
object streams are indeed used otherwise). I agree, that this would be the 
easiest solution here and excluding the signature from compression is as easy 
as adding the following condition:

_COSWriterCompressionPool.addObjectToPool()_
 _!image-2021-08-17-10-21-00-352.png!_

This shall ensure, that Signature dictionaries will always be handled as 
"topLevel" objects and therefore shall never be compressed.

So PDDocument could implement a matching method like:
 !image-2021-08-17-10-56-48-431.png!
 Without the necessity for further changes.

*Result:*
 Adding this results in the signature dictionary to call the 
"visitFromDictionary()" method (as it used to be) and results in documents 
containing valid signatures, that could be located in the document and were 
validatable. (e.g. using Adobe DC)

*Questions:*
 Even though this can be implemented, and even though I would recommend changes 
like this, I would like to ask some questions first:
 - Should saveIncremental provide the option to compress documents at all?
 - Am I missing further issues here?
 - Is it acceptable for signature dictionaries to remain uncompressed?
 - Is the following statement true?: Incremental saving does prepend the 
previous state of the document and therefore uncompressed objects of a previous 
state should remain uncompressed and vice versa object streams should remain 
intact, even if a further signature is added to a document.

*Further question (out of self interest):*
 Would it be possible, or is it intended to support this feature (PDF 
compression) for PDFBox 2?
 If it was, that would save me some headaches (until PDFBox 3 is released), as 
we would like to integrate that functionality in our next release. :)


was (Author: capsvd):
*TL;DR:* 
In the following I'm asking the questions: ** Shall 
"PDDocument.saveIncremental()" also support object compression? Which I 
possibly present a solution for.


Sorry for my absens from this ticket and thank you for considering to make this 
the default behaviour for PDFBox 3!
I only now found the opportunity to have a look at the current trunk (RC PDFBox 
3).

As previously stated, I wanted to have a look into incrementally saving 
(signing) the document and whether this does work correctly for encryption next.
Encryption seems to work just fine, but none of the PDDocument 
"saveIncremental" methods does support CompressParameters (as of now). I 
experimented with that feature and destroyed some signed documents on the way - 
which is resulting in a number of questions and perhaps some solutions:

*What about signatures?*
The first issue one encounters is that, when activating object compression for 
incremental saving, if the signature itself is compressed, it will be invalid 
in the resulting document.
Normally the signature's dictionary would call the method 
"COSWriter.visitFromDictionary()" which then would initialize the 
"signatureOffset", "signatureLength" etc. via:

_COSWriter.visitFromDictionary()_
!image-2021-08-17-10-07-33-682.png!
But, If the signature is compressed it will be part of the object stream and 
will never call said method, those fields will be left uninitialized and 
therefore "doWriteSignature()" will never be called, because of: 

_COSWriter.visitFromDocument()_
!image-2021-08-17-10-10-21-418.png!


*This however can be solved:*
I had a look into Adobe DC and evaluated how it would handle this situation.

According to PDF 32000-1 a signature dictionary is not listed as one of the 
structures, that must never be compressed, it should be compressible and could 
be included in an object stream. However the reference manual also states, that 
it is up to the implementing writer to decide which objects to compress, in 
which order to compress them etc.
I found that Adobe DC does not compress signature dictionaries (even when 
object streams are indeed used otherwise). I agree, that this would be the 
easiest solution here and excluding the signature from compression is as easy 
as adding the following condition:

_COSWriterCompressionPool.addObjectToPool()_
_!image-2021-08-17-10-21-00-352.png!_

This shall ensure, that Signature dictionaries will always be handled as 
"topLevel" objects and therefore shall never be compressed.

So PDDocument could implement a matching method like:
!image-2021-08-17-10-24-44-999.png!
Without the necessity for further changes.

*Result:*
Adding this results in the signature dictionary to call the 
"visitFromDictionary()" method (as it used to be) and results in documents 
containing valid signatures, that could be located in the document and were 
validatable. (e.g. using Adobe DC)


*Questions:*
Even though this can be implemented, and even though I would recommend changes 
like this, I would like to ask some questions first:
- Should saveIncremental provide the option to compress documents at all?
- Am I missing further issues here?
- Is it acceptable for signature dictionaries to remain uncompressed?
- Is the following statement true?: Incremental saving does prepend the 
previous state of the document and therefore uncompressed objects of a previous 
state should remain uncompressed and vice versa object streams should remain 
intact, even if a further signature is added to a document.


*Further question (out of self interest):*
Would it be possible, or is it intended to support this feature (PDF 
compression) for PDFBox 2?
 If it was, that would save me some headaches (until PDFBox 3 is released), as 
we would like to integrate that functionality in our next release. :)

> PDF compression - object stream creation
> ----------------------------------------
>
>                 Key: PDFBOX-4952
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4952
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>    Affects Versions: 2.0.21
>            Reporter: Christian Appl
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>         Attachments: 102_Spot_to_CMYK_X1a.pdf, 
> 102_Spot_to_CMYK_X1a_unc_BAD-3.0.0.pdf, 
> 102_Spot_to_CMYK_X1a_unc_GOOD-2.0.22.pdf, image-2020-09-07-09-47-30-172.png, 
> image-2020-09-07-10-05-15-631.png, image-2021-08-17-10-07-33-682.png, 
> image-2021-08-17-10-10-21-418.png, image-2021-08-17-10-21-00-352.png, 
> image-2021-08-17-10-24-44-999.png, image-2021-08-17-10-56-48-431.png
>
>
> I implemented a basic starting point to realize a PDF compression based on 
> PDFBox 2.0.22-SNAPSHOT
> I want to use this ticket, to ask if you would be interested in such a 
> feature and whether you would be interested to merge it into PDFBox.
> This is sort of a POC, only implementing some very basic functionality, that 
> surely must and could be extended further and it does only implement some 
> very basic and simplistic Unit Tests.
>  However it is able to reduce the size of resulting documents, and creates 
> objectstreams as defined in the PDF reference manual.
> *What it currently does:*
>  It provides the bundling and compression of objects to objectstreams -and 
> further applies simple content compression to a small selection of contents-.
> -To realize content compression, it provides a simple interface and abstract 
> class for "ContentCompressor"s which search a document for specific content, 
> that could be compressed and do compress that contents.-
> -Currently two content compressors exist:-
>  -_ImageCompressor_-
>  -Searches for simple images, that could be compressed using DCT.-
> -_UnencodedStreamCompressor_-
>  -Searches the document for yet unencoded streams and applies a Flate 
> compression where necessary.-
> -Both compressors can be parameterized using a centralized 
> "CompressParameters" instance which is passed to a new "saveCompressed" 
> method of PDDocument.-
> The compression is based on, modifies and is realized by a set of extensions 
> for the "COSWriter" class. Basically it organizes objects, that are passed to 
> the COSWriter in objectStreams -and applies content optimization where 
> necessary and possible-.
> Currently this does support encryption, but does not support linearization of 
> the compressed documents.
> *Caveat:*
>  If this feature is interesting to you, then I would not expect you to simply 
> merge this fork into 2.0.22. I am expecting that you would like to have some 
> details and concepts changed and am ready to implement changes that would be 
> required for this to work to your liking.
> *POC:*
>  4 resulting documents can be found in "target/test-output/compression" when 
> "COSDocumentCompressionTest" is run.
> *The Pull request can be found on Github at:*
>  [https://github.com/apache/pdfbox/pull/86]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-4952) PDF compression - object stream creation

Reply via email to