[ 
https://issues.apache.org/jira/browse/PDFBOX-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Appl updated PDFBOX-4952:
-----------------------------------
    Description: 
I implemented a basic starting point to realize a PDF compression based on 
PDFBox 2.0.22-SNAPSHOT

I want to use this ticket, to ask if you would be interested in such a feature 
and whether you would be interested to merge it into PDFBox.

This is sort of a POC, only implementing some very basic functionality, that 
surely must and could be extended further and it does only implement some very 
basic and simplistic Unit Tests.
 However it is able to reduce the size of resulting documents, and creates 
objectstreams as defined in the PDF reference manual.

*What it currently does:*
 It provides the bundling and compression of objects to objectstreams and 
further applies simple content compression to a small selection of contents.

To realize content compression, it provides a simple interface and abstract 
class for "ContentCompressor"s which search a document for specific content, 
that could be compressed and do compress that contents.

Currently two content compressors exist:
 _ImageCompressor_
 Searches for simple images, that could be compressed using DCT.

_UnencodedStreamCompressor_
 Searches the document for yet unencoded streams and applies a Flate 
compression where necessary.

Both compressors can be parameterized using a centralized "CompressParameters" 
instance which is passed to a new "saveCompressed" method of PDDocument.

The compression is based on, modifies and is realized by a set of extensions 
for the "COSWriter" class. Basically it organizes objects, that are passed to 
the COSWriter in objectStreams and applies content optimization where necessary 
and possible.

Currently this does support encryption, but does not support linearization of 
the compressed documents.

*Caveat:*
 If this feature is interesting to you, then I would not expect you to simply 
merge this fork into 2.0.22. I am expecting that you would like to have some 
details and concepts changed and am ready to implement changes that would be 
required for this to work to your liking.

The main caveat currently is: This implementation is conflicting with the 
changes discussed in PDFBOX-4723 and possibly I will have to adapt parts of 
this solution, if PDFBOX-4723 is implemented as intended.
 (This ticket here is one of the reasons I am invested in that issue)
 For example: It currently would revert the overriden "equals" and "hashcode" 
methods in COSStream and should not be merged as is, if those methods shall 
remain in PDFBOX 2.0.22.

The Pull request can be found on Github at:
https://github.com/apache/pdfbox/pull/86

  was:
I implemented a basic starting point to realize a PDF compression based on 
PDFBox 2.0.22-SNAPSHOT

I want to use this ticket, to ask if you would be interested in such a feature 
and whether you would be interested to merge it into PDFBox.

This is sort of a POC, only implementing some very basic functionality, that 
surely must and could be extended further and it does only implement some very 
basic and simplistic Unit Tests.
However it is able to reduce the size of resulting documents, and creates 
objectstreams as defined in the PDF reference manual.

*What it currently does:*
It provides the bundling and compression of objects to objectstreams and 
further applies simple content compression to a small selection of contents.

To realize content compression, it provides a simple interface and abstract 
class for "ContentCompressor"s which search a document for specific content, 
that could be compressed and do compress that contents.


Currently two content compressors exist:
_ImageCompressor_
Searches for simple images, that could be compressed using DCT.

_UnencodedStreamCompressor_
Searches the document for yet unencoded streams and applies a Flate compression 
where necessary.

Both compressors can be parameterized using a centralized "CompressParameters" 
instance which is passed to a new "saveCompressed" method of PDDocument.

The compression is based on, modifies and is realized by a set of extensions 
for the "COSWriter" class. Basically it organizes objects, that are passed to 
the COSWriter in objectStreams and applies content optimization where necessary 
and possible.

Currently this does support encryption, but does not support linearization of 
the compressed documents.

*Caveat:*
If this feature is interesting to you, then I would not expect you to simply 
merge this fork into 2.0.22. I am expecting that you would like to have some 
details and concepts changed and am ready to implement changes that would be 
required for this to work to your liking.

The main caveat currently is: This implementation is conflicting with the 
changes discussed in PDFBOX-4723 and possibly I will have to adapt parts of 
this solution, if PDFBOX-4723 is implemented as intended.
(This ticket here is one of the reasons I am invested in that issue)
For example: It currently would revert the overriden "equals" and "hashcode" 
methods in COSStream and should not be merged as is, if those methods shall 
remain in PDFBOX 2.0.22.

(I will append an URL to the related fork, when it is ready.)


> PDF compression - object stream creation
> ----------------------------------------
>
>                 Key: PDFBOX-4952
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4952
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>    Affects Versions: 2.0.22
>            Reporter: Christian Appl
>            Priority: Major
>             Fix For: 2.0.22
>
>
> I implemented a basic starting point to realize a PDF compression based on 
> PDFBox 2.0.22-SNAPSHOT
> I want to use this ticket, to ask if you would be interested in such a 
> feature and whether you would be interested to merge it into PDFBox.
> This is sort of a POC, only implementing some very basic functionality, that 
> surely must and could be extended further and it does only implement some 
> very basic and simplistic Unit Tests.
>  However it is able to reduce the size of resulting documents, and creates 
> objectstreams as defined in the PDF reference manual.
> *What it currently does:*
>  It provides the bundling and compression of objects to objectstreams and 
> further applies simple content compression to a small selection of contents.
> To realize content compression, it provides a simple interface and abstract 
> class for "ContentCompressor"s which search a document for specific content, 
> that could be compressed and do compress that contents.
> Currently two content compressors exist:
>  _ImageCompressor_
>  Searches for simple images, that could be compressed using DCT.
> _UnencodedStreamCompressor_
>  Searches the document for yet unencoded streams and applies a Flate 
> compression where necessary.
> Both compressors can be parameterized using a centralized 
> "CompressParameters" instance which is passed to a new "saveCompressed" 
> method of PDDocument.
> The compression is based on, modifies and is realized by a set of extensions 
> for the "COSWriter" class. Basically it organizes objects, that are passed to 
> the COSWriter in objectStreams and applies content optimization where 
> necessary and possible.
> Currently this does support encryption, but does not support linearization of 
> the compressed documents.
> *Caveat:*
>  If this feature is interesting to you, then I would not expect you to simply 
> merge this fork into 2.0.22. I am expecting that you would like to have some 
> details and concepts changed and am ready to implement changes that would be 
> required for this to work to your liking.
> The main caveat currently is: This implementation is conflicting with the 
> changes discussed in PDFBOX-4723 and possibly I will have to adapt parts of 
> this solution, if PDFBOX-4723 is implemented as intended.
>  (This ticket here is one of the reasons I am invested in that issue)
>  For example: It currently would revert the overriden "equals" and "hashcode" 
> methods in COSStream and should not be merged as is, if those methods shall 
> remain in PDFBOX 2.0.22.
> The Pull request can be found on Github at:
> https://github.com/apache/pdfbox/pull/86



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to