[ 
https://issues.apache.org/jira/browse/PDFBOX-4540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837034#comment-16837034
 ] 

Jonathan commented on PDFBOX-4540:
----------------------------------

I happily give you more information about what we have done:

The company I work for develops an electronic file management system which is 
used by some major German governmental institutions. A certain group of 
customers requested a feature where they wanted to display a large number of 
distinct pdf files as one large file. The straight-forward solution was to 
download each file and then append them on client side. This lead to serious 
performance issues for our clients, so we aimed to implement a server-side 
solution.

Upon client request we now parse each indidual file on server side. To reduce 
our memory consumption here, we implemented changes to PDFParser and COSStream 
to not load large streams into memory. Then we do a basic algorithm to append 
these documents to each other and then linearize the result. The algorithm we 
used was inspired by QPDF. We subclassed COSWriter to allow us to write dummy 
versions of our objects in order to calculate the length and offset information 
we need for the hint tables and linearization dictionary.

Our COSWriter subclass also does not write to a file but to a java object 
structure which contains all the pdf information except the aforementioned 
large streams, which are still referenced from outside. Upon client request we 
then dynamically rebuild chunks of the resulting pdf and serve them.

I am going to open some other issues concerning a fix for issues with 
direct/indirect object and one for those lazy streams.

> COSWriter sometimes retrieves wrong ObjectKey
> ---------------------------------------------
>
>                 Key: PDFBOX-4540
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4540
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Writing
>    Affects Versions: 2.0.14
>            Reporter: Jonathan
>            Assignee: Tilman Hausherr
>            Priority: Major
>              Labels: patch, pull-request-available
>             Fix For: 2.0.16, 3.0.0 PDFBox
>
>         Attachments: sample.pdf
>
>
> If a COSBase is directly embedded in a COSObject, it should not be assigned a 
> new object number by the writer. We suggest the following implementation for 
> `COSWriter.getObjectKey(COSBase)`: 
> {code:java}
> /**
>  * This will get the object key for the object.
>  *
>  * @param obj The object to get the key for.
>  *
>  * @return The object key for the object.
> */
> protected COSObjectKey getObjectKey( COSBase obj )
> {
>     COSBase actual = obj;
>     if( actual instanceof COSObject )
>     {
>         actual = ((COSObject)obj).getObject();
>     }
>     COSObjectKey key = null;
>     key = objectKeys.get(obj);
>     if( key == null && actual != null )
>     {
>         key = objectKeys.get(actual);
>     } 
>     if (key == null)
>     {
>         setNumber(getNumber()+1);
>         key = new COSObjectKey(getNumber(),0);
>         objectKeys.put(obj, key);
>         if( actual != null )
>         {
>             objectKeys.put(actual, key);
>         }
>     }
>     return key;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to