Hi Aaron Thanks for that feedback, it has provoked some thoughts.
The impetus was, as you probably know, just to have a way of chaining my finegrained REST methods to make several changes to a datastream without generating several versions. On Sat, 2009-10-31 at 23:25 +0100, Aaron Birkland wrote: > > I have an idea for a transactional system for Fedora, that I would like > > to hear your opinion on. > > I had some time to look it over (comments inline). First, some > background: Transactions have been a topic discussed for many years. > At a fedora architecture summit in early 2007, support for transactions > was identified as a desirable feature - but that topic is not without > controversy. There is a related JIRA item you should be aware of: > > https://fedora-commons.org/jira/browse/FCREPO-435 > > The discussion attached with that issue gives some history. It is > interesting to note that this discussion first started with the idea of > a "super method": a single method containing all data/operations. The > issue has since evolved, but I do not believe there is any consensus for > an approach. Your idea essentially takes the approach of exposing > Fedora as a transactional resource. I was not aware of the history, only that the issue had been discussed for some years. The idea of a super method is new to me, and my gut feeling is against it. It really feels like running blind, ie. performing changes without being able to see what you are doing. Like sailing in a submarine. > > In my own view, if Fedora offered transactions over current operations, > I would try to avoid using them for multiple operations unless they are > required by the application at hand, or if the solution would be > needlessly complex otherwise. Transactions would be a useful tool when > needed, but I would assume that using transactions would come at a cost. Transactions would not be costly, but not free either. There will have to be at least three operations, start, do, commit, rather than one. Transactions would involve extra caching, which might be memory demanding, or IO intensive, so the performance would be worse. > > > 1. First, the Fedora system attempt to get a write-lock on the object. > > The object is being written as part of this transaction, and does not > > allow other processes to edit it. > > There would be a risk of deadlocks here if processes try to obtain locks > for multiple objects in sequence. The implementation would need to pick > a strategy for avoiding or detecting deadlocks Indeed. A synchronized lockObjectsAsPartOfTransaction method, taking a list of PIDs would enable a transaction to lock a series of objects. If any of the objects are already locked, the operation will fall. Still, such a method could be executed in a series, rather than once, and thus two processes could deadlock. I do not have a solution for that on me. > > > 2. The object is parsed into memory, and stored as part of the > > transaction. > > 3. The change is executed on the in-memory object, preserving > > information about which new datastream is created. > > I don't think all these details are essential to your proposal. Rather > than in memory, you may find that it is best to serialize/store objects > in temporary files, a blob store, etc. It is not even strictly > necessary to separately track changes - a comparison with the original > object might suffice. Large managed datastream uploads would almost > certainly require storing some content in temporary files (Fedora does > that now). You are right, I mixed implementation details with design details. Some sort of store is needed for the objects to be changed in the transaction, so that changes can be done, without affecting the repository. > > > 4. In case the change involves something deleted in the same > > transaction: This cause an error, and the change is not carried out. > > Would this abort the transaction, or just the current operation? Good catch. It would only about the current operation. > > > > When the change is committed: > > First, the system upgrades all write locks on modified objects to read > > locks. The locked objects are parsed into memory, and used to service > > read requests while the transaction is written to storage. > > I don't know if this would be strictly necessary. This proposal has an > inherent visibility risk. If a commit fails, than all objects in the > transaction will (eventually) cease being visible. However, there was > a (brief) period of time in which they were visible. If that remains true, > then having objects become visible as they are processed during a commit > may not be any worse. No, all objects will not "(eventually) cease being visible." In that block, I specifically thought of objects existing in the repository before the transaction was started. These should remain visible throughout the transaction. All the changes, to any objects involved in the transaction, should be made visible at the same time. Objects created as part of a transaction should not be visible until the transaction is completed. They should not be briefly visible, they should be nonexistant until all the changes from the transaction have been committed. > > > (This is the risky step. If, and only if, the fedora system goes down > > while committing a transaction, will the repository be left in an > > inconsistent state.) > > There are ways around that (such as write-ahead logs, etc). If a client > has to live with this uncertainty (i.e SOME objects may not have > changed, but you have no way of knowing that without checking), the > transactional system is of little or no value. Thus, the repository > MUST be able to recover from a crash during write. Ideally, it would be > able to finish the transaction, but aborting the transaction and undoing > all changes might be acceptable as well (I believe you propose that > later on in the message). I would not say that the transactional system would be of little or no value if the client is not guaranteed that the repository is consistent if the repository goes down. But yes, this must definitely be handled, because it would be a major flaw of the design. So, if the fedora system goes down 1. Before a transaction is committed: The transaction is lost 2. During the transaction commit: Easy: The transaction is lost. Hard: The transaction is committed. > > > 2. The server goes down during a transaction, before commit is called: > > The transaction is stored only in memory (or similar non-persistent > > storage) so all recollection of the transaction is only relevant while > > the server is running. Reboot removes all transactions. > > Again, I believe in-memory objects is a non-essential implementation > detail. Perhaps temporary files are cleaned up, or in-progress data is > removed from some form of storage. Yes. > > > 3. The server goes down during a commit: The repo is left inconsistent. > > Easiest way to mitigate: Write the old versions of the objects to some > > more permanent store before changing the versions in the repository. > > When the server starts up, restore any objects from this store, so that > > the finished repo is consistent again. > > You could go in either direction: clean up after a failed commit, or to > finish the commit. Completing the commit would be possible if the > in-transaction objects are stored as files, or a WAL is maintained. I > would tend to prefer finishing the commit. For example, changes to an > object can never appear to "go away" if the commit needs to be rolled > back. I do no understand the last sentence. But aside from that, yes, finishing the commit would be better, and perhaps not undoable. > > > Gotchas in this approach: > > 1. The triple store will not be transactional. It will reflect the > > current contents of the repo. We can, as the last part of a transaction, > > ingest all the rdf statements from the changed objects into the triple > > store, so that it gets them all at the same time. Still, there will be a > > tiny desynchronisation between fedora and triple store. > > In theory, the triple store could be a transactional resource, but > that's probably not the best approach. It would be safest to assume > that most infrastructure surrounding Fedora is non-transactional, and > that it will not see the change-set in an atomic manner. In fact, the triple store can see the change-set in an atomic manner, ie. all the changes from the transaction is flushed at the same time, in one block. What I meant was that the triple store will follow the repository, not the temporary objects from a transaction. > > Consider an application listening for messages via JMS. Would it be > possible/practical/desirable, to summarize all objects affected by a > transaction in a single JMS message? What if we assume that a > transaction involving multiple objects will result in multiple JMS > messages, sent serially? This aspect would require some thought. Initial thoughts says that there should be one JMS message per object, as that is the current design. Some changes could be bundled together in one JMS message, while others would possible require several, but that will have to be decided. Having a JMS message just for the finished transaction could probably also be useful. I will write up this proposal on the wiki, and give it some more detail. Regards > > -Aaron > > ------------------------------------------------------------------------------ Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference _______________________________________________ Fedora-commons-developers mailing list Fedora-commons-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers