[ https://issues.apache.org/jira/browse/PDFBOX-4540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837034#comment-16837034 ]
Jonathan commented on PDFBOX-4540: ---------------------------------- I happily give you more information about what we have done: The company I work for develops an electronic file management system which is used by some major German governmental institutions. A certain group of customers requested a feature where they wanted to display a large number of distinct pdf files as one large file. The straight-forward solution was to download each file and then append them on client side. This lead to serious performance issues for our clients, so we aimed to implement a server-side solution. Upon client request we now parse each indidual file on server side. To reduce our memory consumption here, we implemented changes to PDFParser and COSStream to not load large streams into memory. Then we do a basic algorithm to append these documents to each other and then linearize the result. The algorithm we used was inspired by QPDF. We subclassed COSWriter to allow us to write dummy versions of our objects in order to calculate the length and offset information we need for the hint tables and linearization dictionary. Our COSWriter subclass also does not write to a file but to a java object structure which contains all the pdf information except the aforementioned large streams, which are still referenced from outside. Upon client request we then dynamically rebuild chunks of the resulting pdf and serve them. I am going to open some other issues concerning a fix for issues with direct/indirect object and one for those lazy streams. > COSWriter sometimes retrieves wrong ObjectKey > --------------------------------------------- > > Key: PDFBOX-4540 > URL: https://issues.apache.org/jira/browse/PDFBOX-4540 > Project: PDFBox > Issue Type: Bug > Components: Writing > Affects Versions: 2.0.14 > Reporter: Jonathan > Assignee: Tilman Hausherr > Priority: Major > Labels: patch, pull-request-available > Fix For: 2.0.16, 3.0.0 PDFBox > > Attachments: sample.pdf > > > If a COSBase is directly embedded in a COSObject, it should not be assigned a > new object number by the writer. We suggest the following implementation for > `COSWriter.getObjectKey(COSBase)`: > {code:java} > /** > * This will get the object key for the object. > * > * @param obj The object to get the key for. > * > * @return The object key for the object. > */ > protected COSObjectKey getObjectKey( COSBase obj ) > { > COSBase actual = obj; > if( actual instanceof COSObject ) > { > actual = ((COSObject)obj).getObject(); > } > COSObjectKey key = null; > key = objectKeys.get(obj); > if( key == null && actual != null ) > { > key = objectKeys.get(actual); > } > if (key == null) > { > setNumber(getNumber()+1); > key = new COSObjectKey(getNumber(),0); > objectKeys.put(obj, key); > if( actual != null ) > { > objectKeys.put(actual, key); > } > } > return key; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org