[
https://issues.apache.org/jira/browse/PDFBOX-4540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837034#comment-16837034
]
Jonathan commented on PDFBOX-4540:
----------------------------------
I happily give you more information about what we have done:
The company I work for develops an electronic file management system which is
used by some major German governmental institutions. A certain group of
customers requested a feature where they wanted to display a large number of
distinct pdf files as one large file. The straight-forward solution was to
download each file and then append them on client side. This lead to serious
performance issues for our clients, so we aimed to implement a server-side
solution.
Upon client request we now parse each indidual file on server side. To reduce
our memory consumption here, we implemented changes to PDFParser and COSStream
to not load large streams into memory. Then we do a basic algorithm to append
these documents to each other and then linearize the result. The algorithm we
used was inspired by QPDF. We subclassed COSWriter to allow us to write dummy
versions of our objects in order to calculate the length and offset information
we need for the hint tables and linearization dictionary.
Our COSWriter subclass also does not write to a file but to a java object
structure which contains all the pdf information except the aforementioned
large streams, which are still referenced from outside. Upon client request we
then dynamically rebuild chunks of the resulting pdf and serve them.
I am going to open some other issues concerning a fix for issues with
direct/indirect object and one for those lazy streams.
> COSWriter sometimes retrieves wrong ObjectKey
> ---------------------------------------------
>
> Key: PDFBOX-4540
> URL: https://issues.apache.org/jira/browse/PDFBOX-4540
> Project: PDFBox
> Issue Type: Bug
> Components: Writing
> Affects Versions: 2.0.14
> Reporter: Jonathan
> Assignee: Tilman Hausherr
> Priority: Major
> Labels: patch, pull-request-available
> Fix For: 2.0.16, 3.0.0 PDFBox
>
> Attachments: sample.pdf
>
>
> If a COSBase is directly embedded in a COSObject, it should not be assigned a
> new object number by the writer. We suggest the following implementation for
> `COSWriter.getObjectKey(COSBase)`:
> {code:java}
> /**
> * This will get the object key for the object.
> *
> * @param obj The object to get the key for.
> *
> * @return The object key for the object.
> */
> protected COSObjectKey getObjectKey( COSBase obj )
> {
> COSBase actual = obj;
> if( actual instanceof COSObject )
> {
> actual = ((COSObject)obj).getObject();
> }
> COSObjectKey key = null;
> key = objectKeys.get(obj);
> if( key == null && actual != null )
> {
> key = objectKeys.get(actual);
> }
> if (key == null)
> {
> setNumber(getNumber()+1);
> key = new COSObjectKey(getNumber(),0);
> objectKeys.put(obj, key);
> if( actual != null )
> {
> objectKeys.put(actual, key);
> }
> }
> return key;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]