Hmmm... doesn't that just push the problem farther downstream? An ingest involves several changes to storage, as Steve pointed out, and any one of them may fail, which should invalidate the whole batch of changes. And if the swap fails on one or more of the changes? We're right back where we started.
I'm wondering if the built-in messaging framework could be used to provide a record of transactions, a record that could be then replayed or undone by another class whose sole job is to undo changes in a transaction. The only change we'd need to make to the current method DefaultDOManager.doCommit() is to send a message after every step that alters the state of bytes in storage. Another class, or even external service, could receive those messages and log them; and another class (or service) could unroll the transactions in the event an error message appears, or doCommit() catches an exception. Once the final state change has been recorded, and doCommit() is done, the transaction messages could be deleted (or not). These messages are different from logging messages, insofar as they're simply the records of attempts at state changes and their exit statuses, designed to be acted upon, not necessarily read in a log. Also, if any of the rollback actions fail (such as a purge), the failing action could be identified in the logs as a rollback (cleanup) action, as opposed to a standalone purge. -- Scott On 11/16/11, Michael Della Bitta wrote: > > My feeling is that a lot could be gained by writing changes to a temp file > and only swapping it into place if successful. > > > > Michael > > > On Nov 16, 2011 3:26 PM, "Stephen Bayliss" > <stephen.bayl...@acuityunlimited.net <stephen.bayl...@acuityunlimited.net>> > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > This is following on > > > from a conversation we had at yesterday's committer meeting where Eddie > > > mentioned he had a scenario with some ingests failing under heavy load > > leading > > > to potential inconsistences eg between the registry and the object and > > > datastream store. > > > > > > > > > > I think there are > > > two separate types of problem here: > > > > > > > > > > a) being able > > > to debug to isolate the cause of a failure > > > > > b) being able > > > to fix the cause of the failure itself > > > > > > > > > > I have had cause > > > recently to look again at DefaultDOManager.doCommit(...) - which is > > essentially > > > where a new/modified/deleted digital object is committed to > > > storage. > > > > > > > > > > Some observations > > > here are that (aside from there being a lot of code - 300-odd lines in a > > single > > > method), the structure and error reporting makes it difficult to determine > > > genuine causes of concern from situations like "this looks a bit odd but is > > > probably ok". > > > > > > > > > > An > > > example: > > > > > > > > > > If a commit fails > > > (indicated by an exception being thrown), a tidy-up is executed by > > re-invoking > > > doCommit(...) with "remove" set to true. Essentially this is a > > > purge. If anything fails on the purge, a warning message is logged > > > (including in some cases the test "but that might be ok"). It needs to be > > > a warning as if the ingest failed part-way through then some things > > won't be > > > there as expected to clean up. > > > > > > > > > > I think there's a > > > big difference in cleaning up after a failed operation and doing a purge. > > > If a purge fails to remove something that was expected then that should be > > > logged as an error; but as this code is used for both clean-up and purging > > it's > > > not possible to distinguish between the two. > > > > > > > > > > Just one example - > > > but it highlights (a) - being able to debug. The logging is not useful in > > > terms of indicating genuine error conditions. > > > > > > > > > > I think we could do > > > some beneficial refactoring of the existing code which hopefully would not > > risk > > > changing existing functionality to better distinguish genuine error > > > conditions. > > > > > > > > > > It would be useful > > > if the various storage components (datastreams, foxml, resource index, > > registry) > > > were wrapped within some basic transactioning/rollback capabilities - so > > that > > > any cleanup code knows what it should be cleaning up (rather than > > attempting to > > > clean up everything); and then anything that couldn't be cleaned up can > > be > > > logged as an error. Similarly any code that tries to persist a > > > modification to one of the storage components but fails should be logged as > > an > > > error rather than a warning. > > > > > > > > > > There is potential > > > for making things better in this code in general - I can see there could be > > > other situations leading to an inconsistency. For example if managed > > > content datastreams are sucessfully updated (or new versions added), but > > for > > > instance the resource index update fails, then the foxml won't be > > > updated and nor will the registry but the managed content updates will have > > > already been persisted; so we have potential for an inconsistency. Again > > > I'm not sure that the current exception handling and logging is of that > > much > > > help in identifying problematic situations. > > > > > > > > > > I think we should > > > focus on doing what we can to improve (a) in the first instance, if > > possible > > > without disturbing too much of the logic already embodied in this peice of > > > code. And to try and do this without negatively impacting performance > > > given the focus of 3.6. > > > > > > > > > > Steve > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > All the data continuously generated in your IT infrastructure > > > > > contains a definitive record of customers, application performance, > > > > > security threats, fraudulent activity, and more. Splunk takes this > > > > > data and makes sense of it. IT sense. And common sense. > > > > > http://p.sf.net/sfu/splunk-novd2d > > _______________________________________________ > > > > > Fedora-commons-developers mailing list > > > > > Fedora-commons-developers@lists.sourceforge.net > > <Fedora-commons-developers@lists.sourceforge.net> > > > > > https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers > > > > > > > > > > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > > > _______________________________________________ > Fedora-commons-developers mailing list > Fedora-commons-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers -- -- Scott Prater Library, Instructional, and Research Applications (LIRA) Division of Information Technology (DoIT) University of Wisconsin - Madison pra...@wisc.edu ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Fedora-commons-developers mailing list Fedora-commons-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers