[ 
https://issues.apache.org/jira/browse/PDFBOX-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612144#comment-17612144
 ] 

Stefan Ziegler edited comment on PDFBOX-5522 at 10/2/22 7:00 PM:
-----------------------------------------------------------------

We process a lot of PDF files created by external programs. There are several 
reasons for using an own COSWriter, e.g. for the following:

1) PDF files are not always completely correct and every now and then we have 
to intervene to correct them. We can not change the external application.

2) Another point is a recompression of COSStreams. We often also process PDFs 
that are not or not optimally compressed. We can then reprocess all streams, 
determine the best compression and rewrite them.

3) A third point are extractions, e.g. we can extract all images, including 
some hidden one that we can't access through the pages objects.

A very simple COSWriter that compresses streams that are not yet compressed 
would be this one:
{code:java}
public class CompressingCosWriter extends COSWriter {
    public CompressingCosWriter(File file) throws FileNotFoundException {
        super(new BufferedOutputStream(new FileOutputStream(file)));
    }
    public CompressingCosWriter(OutputStream outputStream) {
        super(outputStream);
    }
    @Override
    public Object visitFromStream(COSStream obj) throws IOException {
        PDStream stream = new PDStream(obj);
        List<COSName> filters = stream.getFilters();
        if ((filters == null || filters.size() == 0) && stream.getLength() > 
20) {
            obj.removeItem(COSName.DECODE_PARMS);
            OutputStream out = null;
            try {
                byte[] bytes = IOUtils.toByteArray(stream.createInputStream());
                out = stream.createOutputStream(COSName.FLATE_DECODE);
                out.write(bytes);
            } catch (IOException e) {
                throw new RuntimeException(e);
            } finally {
                IOUtils.closeQuietly(out);
            }
        }
        return super.visitFromStream(obj);
    }
}
{code}
 


was (Author: JIRAUSER289113):
We process a lot of PDF files created by external programs. There are several 
reasons for using an own COSWriter, e.g. for the following:

1) PDF files are not always completely correct and every now and then we have 
to intervene to correct them. We can not change the external application.

2) Another point is a recompression of COSStreams. We often also process PDFs 
that are not or not optimally compressed. We can then reprocess all streams, 
determine the best compression and rewrite them.

3) A third point are extractions, e.g. we can extract all images, including 
some hidden one that we can't access through the pages objects.

 

A very simple COSWriter that compresses streams that are not yet compressed 
would be this one:
{code:java}
public class CompressingCosWriter extends COSWriter {
    public CompressingCosWriter(File file) throws FileNotFoundException {
        super(new BufferedOutputStream(new FileOutputStream(file)));
    }
    public CompressingCosWriter(OutputStream outputStream) {
        super(outputStream);
    }
    @Override
    public Object visitFromStream(COSStream obj) throws IOException {
        PDStream stream = new PDStream(obj);
        List<COSName> filters = stream.getFilters();
        if ((filters == null || filters.size() == 0) && stream.getLength() > 
20) {
            obj.removeItem(COSName.DECODE_PARMS);
            OutputStream out = null;
            try {
                byte[] bytes = IOUtils.toByteArray(stream.createInputStream());
                out = stream.createOutputStream(COSName.FLATE_DECODE);
                out.write(bytes);
            } catch (IOException e) {
                throw new RuntimeException(e);
            } finally {
                IOUtils.closeQuietly(out);
            }
        }
        return super.visitFromStream(obj);
    }
}
{code}
 

> Add public void save(COSWriter writer) to PDDocument
> ----------------------------------------------------
>
>                 Key: PDFBOX-5522
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5522
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: PDModel
>    Affects Versions: 2.0.28, 3.0.0 PDFBox
>            Reporter: Stefan Ziegler
>            Priority: Minor
>         Attachments: PDDocument.java.patch
>
>
> Please add the following method to PDDocument:
>  
> {code:java}
> public void save(COSWriter writer){code}
>  
> Why?
> This gives us the possibility to use a custom COSWriter when saving the PDF 
> file. Inside the custom COSWriter we can add some checks and convert some 
> data structures if required.
>  
> Patch is attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to