To clarify I think anything we do in 3.6 around rollback/transactions should be focussed only on ensuring/improving on Fedora's "preservation guarantee" - this should not be about introducing transactional behaviour in the sense that it could be exposed at the API level.
Michael's last post made me think - whether we should restrict any transaction/rollback behaviour at this stage to the "canonical store" only - ie to ensure that persisted FOXML and managed content are always in sync. And just log errors for any other failures. For reference, the store/services involved are: Objects - through ILowlevelStorage interface Datastreams - through ILowlevelStorage interface FieldSearch - through the FieldSearch module RI - through the Resource Index module Registry - SQL updates via the connection pool for the db ReaderCache - through DOReaderCache And Adam raises a good point I think about dealing with instances and services. I think it is a question for the future... We'd need to consider what level to introduce such behaviour with respect to the above "stores". Steve > -----Original Message----- > From: aj...@virginia.edu [mailto:aj...@virginia.edu] > Sent: 17 November 2011 21:54 > To: fedora-commons-developers@lists.sourceforge.net Developers > Subject: Re: [fcrepo-dev] Fedora failing under heavy load - > debuggingtheissues > > > I agree that JTA might be the best direction, but as Steve > points out, that's a good ways away at best. Particularly, I > wouldn't like to spend too much thought on that question > until the work around High Level Storage is a bit further > along. I agree completely with Steve's suggestions for > logging changes, and particularly in light of this point: > > > we're unable from log analysis to easily trawl log files and > > distinguish between eg invalid requests, invalid FOXML etc and a > > failure in persisting modifications to an object. > > > Steve-- > > The way I read your suggestions below, you're introducing > something of a two-phase commit pattern for persistence, yes? > One question that arises in that context: what are the > subsystems that would need to participate in that polling? In > the list "(object, datastream, resource index, registry, > fieldSearch)" we have both instances and services... can we > make that a list of services only? > > To the introduction of rollback handling : > > > But I do wonder here whether this is straying outside of > the 3.6 remit > > (and it has the risk of degrading performance potentially). > > > I think it is. Introducing rollback handling creates, to my > mind, the burden of a particular explicit set of guarantees > to clients of the repository. That seems like a worthy goal > to me, but it's not one to be undertaken lightly, and the > point about performance is a very strong one. I would suggest > that we rear back from this boundary, unless we have > immediately and very powerfully compelling use cases. In a > different scope, introducing this level of transactionality > is definitely not in the spirit of a "feature-free" final > 3-series release. > > --- > A. Soroka > Online Library Environment > the University of Virginia Library > > > > > On Nov 17, 2011, at 5:44 AM, Stephen Bayliss wrote: > > > I wonder if JTA isn't the way to go on this. > > > > But I think it would be useful to limit the scope to the objectives > > and goals of 3.6; and recognise that full transaction handling is > > something for the future; and we may not want to start down a track > > now that sets a direction that sets us on a course that may need to > > change in the future. > > > > Around the 3.6 theme of scalability: being able to debug and > > understand what's going on when Fedora fails under heavy > > ingest/modification load with a number of threads would seem like a > > useful objective. > > > > The current "commit" code handles exceptions/errors thus, > as far as I > > can see (at a high level): > > > > 1) On ingest: if there is a failure > > - an exception is thrown and caught > > - tidy up by doing a purge, reporting any purge failures as warnings > > - re-throw as ServerException > > - reported as WARN in the logs by REST API exception handling > > > > 2) On a modify > > - an exception is thrown and caught > > - no attempt at tidy-up > > - re-throw ServerException > > - reported as WARN in the logs by REST API exception handling > > > > So first stage improvements could be: > > > > a) For each "store" (object, datastream, resource index, registry, > > fieldSearch) record (just a boolean flag) whether or not > changes were > > successfully made. > > b) Revise handling of Ingest tidy-up code: Don't use the > current purge > > code, but instead explicitly tidy-up whichever of the stores were > > updated. Log error messages if tidy-up did not succeed. > > c) Change error logging in the purge code to ERROR (they > are WARN at the > > moment because of the re-use of the purge code for ingest tidy-up) > > d) Ensure an explicit ERROR is logged for the ingest > failure when one of the > > stores failed to be updated (currently I think the > exception just bubbles up > > and is reported as a WARN by eg the REST API exception handling). > > > > This would at least give us greater insight of what's > happening when > > there is a failure under heavy ingest. > > > > The next stage could be: > > > > a) Revise other error logging in doCommit() so that an > explicit ERROR > > is logged if an update to a store fails. The current exception > > handling for the REST API catches the exceptions and logs these in > > general as WARN; so we're unable from log analysis to > easily trawl log > > files and distinguish between eg invalid requests, invalid > FOXML etc > > and a failure in persisting modifications to an object. There is a > > danger here though of confusing errors from invalid requests (eg > > deleting a datastream that doesn't exist) with genuine > errors from the > > doCommit(). We'd need a review of DefaultManagement code > to see when > > detecting an error in the request is reliant on bubbling up > exceptions > > such as ObjectNotInLowLevelStorage from the commit, so we could > > distinguish between deleting a datastream that does not exist vs > > deleting a datastream failing because the expected file > isn't present > > - though from a quick look error handling for requests on > non-existent > > content isn't reliant on these. > > b) Revise other error logging elsewhere to ensure we are > distinguishing > > between genuine repository errors (which should be logged > as ERROR) and eg > > requests for non-existent objects and datastreams (As an > example - try > > deleting the file from the object store for a managed > content datastream; > > and then try and delete the datastream in Fedora - you'll > get a 200 response > > with a WARN in the log, with a stacktrace showing that > doCommit failed). > > > > And then: > > > > a) Implement basic transaction/rollback handling - this > should be at > > the level of recording what changes have been made to each > individual > > store for a single request so they can be backed-out in case of > > failure. This would reduce instances of an inconsistent repository > > when modifications fail. But I do wonder here whether this is > > straying outside of the 3.6 remit (and it has the risk of degrading > > performance potentially). > > > > Steve > > > > > > > > > > > >> -----Original Message----- > >> From: Scott Prater [mailto:pra...@wisc.edu] > >> Sent: 17 November 2011 05:54 > >> To: fedora-commons-developers@lists.sourceforge.net > >> Subject: Re: [fcrepo-dev] Fedora failing under heavy load - > >> debugging theissues > >> > >> > >> Hmmm... doesn't that just push the problem farther > >> downstream? An ingest involves several changes to storage, > >> as Steve pointed out, and any one of them may fail, which > >> should invalidate the whole batch of changes. And if the > >> swap fails on one or more of the changes? We're right back > >> where we started. > >> > >> I'm wondering if the built-in messaging framework could be > >> used to provide a record of transactions, a record that could > >> be then replayed or undone by another class whose sole job is > >> to undo changes in a transaction. The only change we'd need > >> to make to the current method DefaultDOManager.doCommit() is > >> to send a message after every step that alters the state of > >> bytes in storage. Another class, or even external service, > >> could receive those messages and log them; and another class > >> (or service) could unroll the transactions in the event an > >> error message appears, or doCommit() catches an exception. > >> > >> Once the final state change has been recorded, and doCommit() > >> is done, the transaction messages could be deleted (or not). > >> > >> These messages are different from logging messages, insofar > >> as they're simply the records of attempts at state changes > >> and their exit statuses, designed to be acted upon, not > >> necessarily read in a log. Also, if any of the rollback > >> actions fail (such as a purge), the failing action could be > >> identified in the logs as a rollback (cleanup) action, as > >> opposed to a standalone purge. > >> > >> -- Scott > >> > >> > >> > >> On 11/16/11, Michael Della Bitta wrote: > >>> > >>> My feeling is that a lot could be gained by writing changes > >> to a temp > >>> file and only swapping it into place if successful. > >>> > >>> > >>> > >>> Michael > >>> > >>> > >>> On Nov 16, 2011 3:26 PM, "Stephen Bayliss" > >>> <stephen.bayl...@acuityunlimited.net > >>> <stephen.bayl...@acuityunlimited.net>> wrote: > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> > >>> > >>>> This is following on > >>> > >>>> from a conversation we had at yesterday's committer meeting > >>>> where Eddie > >>> > >>>> mentioned he had a scenario with some ingests failing under heavy > >>>> load leading > >>> > >>>> to potential inconsistences eg between the registry and > >> the object > >>>> and > >>> > >>>> datastream store. > >>>> > >>> > >>>> > >>>> > >>> > >>>> I think there are > >>> > >>>> two separate types of problem here: > >>>> > >>> > >>>> > >>>> > >>> > >>>> a) being able > >>> > >>>> to debug to isolate the cause of a failure > >>>> > >>> > >>>> b) being able > >>> > >>>> to fix the cause of the failure itself > >>>> > >>> > >>>> > >>>> > >>> > >>>> I have had cause > >>> > >>>> recently to look again at DefaultDOManager.doCommit(...) > >> - which is > >>>> essentially > >>> > >>>> where a new/modified/deleted digital object is committed to > >>> > >>>> storage. > >>>> > >>> > >>>> > >>>> > >>> > >>>> Some observations > >>> > >>>> here are that (aside from there being a lot of code - > >> 300-odd lines > >>>> in a single > >>> > >>>> method), the structure and error reporting makes it difficult to > >>>> determine > >>> > >>>> genuine causes of concern from situations like "this > >> looks a bit odd > >>>> but is > >>> > >>>> probably ok". > >>>> > >>> > >>>> > >>>> > >>> > >>>> An > >>> > >>>> example: > >>>> > >>> > >>>> > >>>> > >>> > >>>> If a commit fails > >>> > >>>> (indicated by an exception being thrown), a tidy-up is > >> executed by > >>>> re-invoking > >>> > >>>> doCommit(...) with "remove" set to true. Essentially this is a > >>> > >>>> purge. If anything fails on the purge, a warning message > >> is logged > >>> > >>>> (including in some cases the test "but that might be > >> ok"). It needs > >>>> to be > >>> > >>>> a warning as if the ingest failed part-way through then > >> some things > >>>> won't be > >>> > >>>> there as expected to clean up. > >>>> > >>> > >>>> > >>>> > >>> > >>>> I think there's a > >>> > >>>> big difference in cleaning up after a failed operation > >> and doing a > >>>> purge. > >>> > >>>> If a purge fails to remove something that was expected then that > >>>> should be > >>> > >>>> logged as an error; but as this code is used for both > >> clean-up and > >>>> purging it's > >>> > >>>> not possible to distinguish between the two. > >>>> > >>> > >>>> > >>>> > >>> > >>>> Just one example - > >>> > >>>> but it highlights (a) - being able to debug. The logging is not > >>>> useful in > >>> > >>>> terms of indicating genuine error conditions. > >>>> > >>> > >>>> > >>>> > >>> > >>>> I think we could do > >>> > >>>> some beneficial refactoring of the existing code which hopefully > >>>> would not risk > >>> > >>>> changing existing functionality to better distinguish > >> genuine error > >>> > >>>> conditions. > >>>> > >>> > >>>> > >>>> > >>> > >>>> It would be useful > >>> > >>>> if the various storage components (datastreams, foxml, resource > >>>> index, registry) > >>> > >>>> were wrapped within some basic transactioning/rollback > >> capabilities > >>>> - so that > >>> > >>>> any cleanup code knows what it should be cleaning up (rather than > >>>> attempting to > >>> > >>>> clean up everything); and then anything that couldn't > >> be cleaned > >>>> up can be > >>> > >>>> logged as an error. Similarly any code that tries to persist a > >>> > >>>> modification to one of the storage components but fails should be > >>>> logged as an > >>> > >>>> error rather than a warning. > >>>> > >>> > >>>> > >>>> > >>> > >>>> There is potential > >>> > >>>> for making things better in this code in general - I can > >> see there > >>>> could be > >>> > >>>> other situations leading to an inconsistency. For example if > >>>> managed > >>> > >>>> content datastreams are sucessfully updated (or new > >> versions added), > >>>> but for > >>> > >>>> instance the resource index update fails, then the foxml > >> won't > >>>> be > >>> > >>>> updated and nor will the registry but the managed content updates > >>>> will have > >>> > >>>> already been persisted; so we have potential for an > >> inconsistency. > >>>> Again > >>> > >>>> I'm not sure that the current exception handling and > >> logging is > >>>> of that much > >>> > >>>> help in identifying problematic situations. > >>>> > >>> > >>>> > >>>> > >>> > >>>> I think we should > >>> > >>>> focus on doing what we can to improve (a) in the first > >> instance, if > >>>> possible > >>> > >>>> without disturbing too much of the logic already embodied in this > >>>> peice of > >>> > >>>> code. And to try and do this without negatively impacting > >>>> performance > >>> > >>>> given the focus of 3.6. > >>>> > >>> > >>>> > >>>> > >>> > >>>> Steve > >>>> > >>> > >>>> > >>>> > >>>> > >>> > >>>> > >>>> > >> > -------------------------------------------------------------------- > >>>> ---------- > >>>> > >>> > >>>> All the data continuously generated in your IT infrastructure > >>>> > >>> > >>>> contains a definitive record of customers, application > >> performance, > >>>> > >>> > >>>> security threats, fraudulent activity, and more. Splunk > takes this > >>>> > >>> > >>>> data and makes sense of it. IT sense. And common sense. > >>>> > >>> > >>>> http://p.sf.net/sfu/splunk-novd2d > >>>> _______________________________________________ > >>>> > >>> > >>>> Fedora-commons-developers mailing list > >>>> > >>> > >>>> Fedora-commons-developers@lists.sourceforge.net > >>>> <Fedora-commons-developers@lists.sourceforge.net> > >>>> > >>> > >>>> > >> > https://lists.sourceforge.net/lists/listinfo/fedora-commons-develope > >>>> rs > >>>> > >>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >>> > >>> > >> > --------------------------------------------------------------------- > >> - > >>> -------- > >>> All the data continuously generated in your IT infrastructure > >>> contains a definitive record of customers, application > performance, > >>> security threats, fraudulent activity, and more. Splunk > takes this > >>> data and makes sense of it. IT sense. And common sense. > >>> http://p.sf.net/sfu/splunk-novd2d > >>> > >>> > >>> _______________________________________________ > >>> Fedora-commons-developers mailing list > >>> Fedora-commons-developers@lists.sourceforge.net > >>> > >> > https://lists.sourceforge.net/lists/listinfo/fedora-commons-developer > >> s > >> > >> -- > >> -- > >> Scott Prater > >> Library, Instructional, and Research Applications (LIRA) > >> Division of Information Technology (DoIT) University of > >> Wisconsin - Madison pra...@wisc.edu > >> > >> -------------------------------------------------------------- > >> ---------------- > >> All the data continuously generated in your IT infrastructure > >> contains a definitive record of customers, application > performance, > >> security threats, fraudulent activity, and more. Splunk takes this > >> data and makes sense of it. IT sense. And common sense. > >> http://p.sf.net/sfu/splunk-novd2d > >> _______________________________________________ > >> Fedora-commons-developers mailing list > >> Fedora-commons-developers@lists.sourceforge.net > >> > https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers > >> > > > > > > > ---------------------------------------------------------------------- > > -------- > > All the data continuously generated in your IT infrastructure > > contains a definitive record of customers, application performance, > > security threats, fraudulent activity, and more. Splunk takes this > > data and makes sense of it. IT sense. And common sense. > > http://p.sf.net/sfu/splunk-novd2d > > _______________________________________________ > > Fedora-commons-developers mailing list > > Fedora-commons-developers@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers > > > -------------------------------------------------------------- > ---------------- > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Fedora-commons-developers mailing list > Fedora-commons-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Fedora-commons-developers mailing list Fedora-commons-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers