> On 8 Jan 2015, at 13:54, Leonard Rosenthol <lrose...@adobe.com> wrote: > > On 1/8/15, 9:17 PM, "John Hewson" <j...@jahewson.com> wrote: > >> I’d argue the opposite - in Java one expects objects to be shared unless >> there is an explicit call to clone(). e.g. can you think of an example >> from the Java standard library where explicit copying occurs? I can’t. >> There’s just no way to fight this in Java, it’s a fact of life. It’s not >> like C++ where there are copy semantics. > > We’ve been talking about around the (virtual) office today and we do agree > with you that philosophically Java assumes objects are always by > reference. And while that could well be a useful feature it also (in this > example, and others) doesn’t allow you to properly model the underlying > system.
Yes, there’s an awkward mismatch between Java objects and COS objects, especially when one thinks about indirect objects in COS vs. what it means to be a reference in Java. > >>> I already asked this sooner, but I'm happy to repeat my question about >>> concurrent editing. >>> How do we ensure that the whole stuff is "foolproof", so that people >>> who don't have a clue about the internals can use it without breaking >>> their pdfs by accident? >> >> You can’t. Objects in Java are passed by reference, and there’s nothing >> we can do about it. Today you can use the PDFBox API to take a >> COSDictionary from one document and insert it directly into another >> document and *it’s fine*, it works. > > For all the features *currently present* in PDFBox, it works. However, > there are features present in other PDF libraries that you may wish to > implement - for example, incremental updates - that will not be possible > with this model because you’ve lost the connection to the original > document. We have some support for incremental update in PDFBox already, but I don’t see any reason why that should be limited by sharing objects. A hash map of COS objects in COSDocument is sufficient to track any update state specific to an individual COS object in a given document and has the added benefit of keeping document state out of COS object classes. Alternatively, should we wish to store document state inside COS objects, then we would have all the information necessary to generate a meaningful error should an incremental update be attempted on a COS object which belongs to another document. In this case the solution is for the user to clone() the relevant COS object - this feels natural. > Another good example of why you need to maintain the connection to the > document is on-demand decryption. Since the key to decrypt the String or > Stream is a combination of the file key, the encryption algorithm > specification, AND the object number and revision (all coming from the > original document), it’s impossible to decrypt that object data on demand > if you’ve “connected” it with another document. Assuming that streams > aren’t decrypted until they are accessed - this could be an issue today. PDFBox doesn’t store the object number and revision in it’s COS object classes, so that’s not a problem for us. These numbers are instead stored in a hash map inside COSDocument. That means that each COS object is independent of a specific COSDocument, with the exception of the backing stream for a COSStream. I realise that this might be unusual. Currently we don’t do on-demand decryption, but if we did, then the backing stream which is passed to COSStream could handle this. Each COSStream gets its own InputStream to read from the source file, these are created by our PDF parser. When doing an encrypted read we could have the parser create an EncryptedInputStream whose constructor takes the key as an argument, as the parser has all the information required to determine the key. This process would be opaque to COSStream, which would perform reads in the usual manner. > >> In other words, the current API allows users to shoot themselves in the >> foot because it corrupts COS objects from closed documents. All I’m >> proposing is to fix that by not clearing the memory the COS objects when >> closing their parent document. > > Is there any way you could reparent it at this point? It might require a > back-walk up the object tree, which could be slow, but it should be > doable. And at least for simple objects, would then make it work as > expected. No, because the data as been erased. Calling close() on a COSDocument loops through a hash map of every COS object from that document and clears its contents. We’re in the process of figuring out why exactly that is and if it is necessary for objects other than COSStream. Streams are the exception, of course, as they need their backing stream to be open still. What I’m proposing is a fairly unexciting change to COSDocument’s close() method, but it’s yielded a useful discussion - assuming that we’re now all on the same page :) — John > Leonard >