Re: [fcrepo-dev] Fedora failing under heavy load - debuggingtheissues

Stephen Bayliss Fri, 18 Nov 2011 00:43:20 -0800

To clarify I think anything we do in 3.6 around rollback/transactions should
be focussed only on ensuring/improving on Fedora's "preservation guarantee"
- this should not be about introducing transactional behaviour in the sense
that it could be exposed at the API level.


Michael's last post made me think - whether we should restrict any
transaction/rollback behaviour at this stage to the "canonical store" only -
ie to ensure that persisted FOXML and managed content are always in sync.
And just log errors for any other failures.

For reference, the store/services involved are:
Objects - through ILowlevelStorage interface
Datastreams - through ILowlevelStorage interface
FieldSearch - through the FieldSearch module
RI - through the Resource Index module
Registry - SQL updates via the connection pool for the db
ReaderCache - through DOReaderCache

And Adam raises a good point I think about dealing with instances and
services.  I think it is a question for the future...  We'd need to consider
what level to introduce such behaviour with respect to the above "stores".

Steve

> -----Original Message-----
> From: aj...@virginia.edu [mailto:aj...@virginia.edu] 
> Sent: 17 November 2011 21:54
> To: fedora-commons-developers@lists.sourceforge.net Developers
> Subject: Re: [fcrepo-dev] Fedora failing under heavy load - 
> debuggingtheissues
> 
> 
> I agree that JTA might be the best direction, but as Steve 
> points out, that's a good ways away at best. Particularly, I 
> wouldn't like to spend too much thought on that question 
> until the work around High Level Storage is a bit further 
> along. I agree completely with Steve's suggestions for 
> logging changes, and particularly in light of this point:
> 
> > we're unable from log analysis to easily trawl log files and 
> > distinguish between eg invalid requests, invalid FOXML etc and a 
> > failure in persisting modifications to an object.
> 
> 
> Steve--
> 
> The way I read your suggestions below, you're introducing 
> something of a two-phase commit pattern for persistence, yes? 
> One question that arises in that context: what are the 
> subsystems that would need to participate in that polling? In 
> the list "(object, datastream, resource index, registry, 
> fieldSearch)" we have both instances and services... can we 
> make that a list of services only?
> 
> To the introduction of rollback handling :
> 
> > But I do wonder here whether this is straying outside of 
> the 3.6 remit 
> > (and it has the risk of degrading performance potentially).
> 
> 
> I think it is. Introducing rollback handling creates, to my 
> mind, the burden of a particular explicit set of guarantees 
> to clients of the repository. That seems like a worthy goal 
> to me, but it's not one to be undertaken lightly, and the 
> point about performance is a very strong one. I would suggest 
> that we rear back from this boundary, unless we have 
> immediately and very powerfully compelling use cases. In a 
> different scope, introducing this level of transactionality 
> is definitely not in the spirit of a "feature-free" final 
> 3-series release.
> 
> ---
> A. Soroka
> Online Library Environment
> the University of Virginia Library
> 
> 
> 
> 
> On Nov 17, 2011, at 5:44 AM, Stephen Bayliss wrote:
> 
> > I wonder if JTA isn't the way to go on this.
> > 
> > But I think it would be useful to limit the scope to the objectives 
> > and goals of 3.6; and recognise that full transaction handling is 
> > something for the future; and we may not want to start down a track 
> > now that sets a direction that sets us on a course that may need to 
> > change in the future.
> > 
> > Around the 3.6 theme of scalability: being able to debug and 
> > understand what's going on when Fedora fails under heavy 
> > ingest/modification load with a number of threads would seem like a 
> > useful objective.
> > 
> > The current "commit" code handles exceptions/errors thus, 
> as far as I 
> > can see (at a high level):
> > 
> > 1) On ingest: if there is a failure
> > - an exception is thrown and caught
> > - tidy up by doing a purge, reporting any purge failures as warnings
> > - re-throw as ServerException
> > - reported as WARN in the logs by REST API exception handling
> > 
> > 2) On a modify
> > - an exception is thrown and caught
> > - no attempt at tidy-up
> > - re-throw ServerException
> > - reported as WARN in the logs by REST API exception handling
> > 
> > So first stage improvements could be:
> > 
> > a) For each "store" (object, datastream, resource index, registry,
> > fieldSearch) record (just a boolean flag) whether or not 
> changes were 
> > successfully made.
> > b) Revise handling of Ingest tidy-up code: Don't use the 
> current purge 
> > code, but instead explicitly tidy-up whichever of the stores were 
> > updated.  Log error messages if tidy-up did not succeed.
> > c) Change error logging in the purge code to ERROR (they 
> are WARN at the
> > moment because of the re-use of the purge code for ingest tidy-up)
> > d) Ensure an explicit ERROR is logged for the ingest 
> failure when one of the
> > stores failed to be updated (currently I think the 
> exception just bubbles up
> > and is reported as a WARN by eg the REST API exception handling).
> > 
> > This would at least give us greater insight of what's 
> happening when 
> > there is a failure under heavy ingest.
> > 
> > The next stage could be:
> > 
> > a) Revise other error logging in doCommit() so that an 
> explicit ERROR 
> > is logged if an update to a store fails.  The current exception 
> > handling for the REST API catches the exceptions and logs these in 
> > general as WARN; so we're unable from log analysis to 
> easily trawl log 
> > files and distinguish between eg invalid requests, invalid 
> FOXML etc 
> > and a failure in persisting modifications to an object.  There is a 
> > danger here though of confusing errors from invalid requests (eg 
> > deleting a datastream that doesn't exist) with genuine 
> errors from the 
> > doCommit().  We'd need a review of DefaultManagement code 
> to see when 
> > detecting an error in the request is reliant on bubbling up 
> exceptions 
> > such as ObjectNotInLowLevelStorage from the commit, so we could 
> > distinguish between deleting a datastream that does not exist vs 
> > deleting a datastream failing because the expected file 
> isn't present 
> > - though from a quick look error handling for requests on 
> non-existent 
> > content isn't reliant on these.
> > b) Revise other error logging elsewhere to ensure we are 
> distinguishing
> > between genuine repository errors (which should be logged 
> as ERROR) and eg
> > requests for non-existent objects and datastreams (As an 
> example - try
> > deleting the file from the object store for a managed 
> content datastream;
> > and then try and delete the datastream in Fedora - you'll 
> get a 200 response
> > with a WARN in the log, with a stacktrace showing that 
> doCommit failed).
> > 
> > And then:
> > 
> > a) Implement basic transaction/rollback handling - this 
> should be at 
> > the level of recording what changes have been made to each 
> individual 
> > store for a single request so they can be backed-out in case of 
> > failure.  This would reduce instances of an inconsistent repository 
> > when modifications fail.  But I do wonder here whether this is 
> > straying outside of the 3.6 remit (and it has the risk of degrading 
> > performance potentially).
> > 
> > Steve
> > 
> > 
> > 
> > 
> > 
> >> -----Original Message-----
> >> From: Scott Prater [mailto:pra...@wisc.edu]
> >> Sent: 17 November 2011 05:54
> >> To: fedora-commons-developers@lists.sourceforge.net
> >> Subject: Re: [fcrepo-dev] Fedora failing under heavy load - 
> >> debugging theissues
> >> 
> >> 
> >> Hmmm... doesn't that just push the problem farther
> >> downstream?  An ingest involves several changes to storage, 
> >> as Steve pointed out, and any one of them may fail, which 
> >> should invalidate the whole batch of changes.  And if the 
> >> swap fails on one or more of the changes?  We're right back 
> >> where we started.
> >> 
> >> I'm wondering if the built-in messaging framework could be
> >> used to provide a record of transactions, a record that could 
> >> be then replayed or undone by another class whose sole job is 
> >> to undo changes in a transaction.  The only change we'd need 
> >> to make to the current method DefaultDOManager.doCommit() is 
> >> to send a message after every step that alters the state of 
> >> bytes in storage.  Another class, or even external service, 
> >> could receive those messages and log them;  and another class 
> >> (or service) could unroll the transactions in the event an 
> >> error message appears, or doCommit() catches an exception.
> >> 
> >> Once the final state change has been recorded, and doCommit()
> >> is done, the transaction messages could be deleted (or not).
> >> 
> >> These messages are different from logging messages, insofar
> >> as they're simply the records of attempts at state changes 
> >> and their exit statuses, designed to be acted upon, not 
> >> necessarily read in a log.  Also, if any of the rollback 
> >> actions fail (such as a purge), the failing action could be 
> >> identified in the logs as a rollback (cleanup) action, as 
> >> opposed to a standalone purge.
> >> 
> >> -- Scott
> >> 
> >> 
> >> 
> >> On 11/16/11, Michael Della Bitta   wrote:
> >>> 
> >>> My feeling is that a lot could be gained by writing changes
> >> to a temp
> >>> file and only swapping it into place if successful.
> >>> 
> >>> 
> >>> 
> >>> Michael
> >>> 
> >>> 
> >>> On Nov 16, 2011 3:26 PM, "Stephen Bayliss"
> >>> <stephen.bayl...@acuityunlimited.net 
> >>> <stephen.bayl...@acuityunlimited.net>> wrote:
> >>> 
> >>>> 
> >>> 
> >>>> 
> >>> 
> >>>> 
> >>> 
> >>>> 
> >>> 
> >>>> 
> >>> 
> >>>> 
> >>> 
> >>>> 
> >>> 
> >>>> 
> >>> 
> >>>> This is following on
> >>> 
> >>>> from a conversation we had at yesterday&#39;s committer meeting
> >>>> where Eddie
> >>> 
> >>>> mentioned he had a scenario with some ingests failing under heavy
> >>>> load leading
> >>> 
> >>>> to potential inconsistences eg between the registry and
> >> the object
> >>>> and
> >>> 
> >>>> datastream store.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> I think there are
> >>> 
> >>>> two separate types of problem here:
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> a) being able
> >>> 
> >>>> to debug to isolate the cause of a failure
> >>>> 
> >>> 
> >>>> b) being able
> >>> 
> >>>> to fix the cause of the failure itself
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> I have had cause
> >>> 
> >>>> recently to look again at DefaultDOManager.doCommit(...)
> >> - which is
> >>>> essentially
> >>> 
> >>>> where a new/modified/deleted digital object is committed to
> >>> 
> >>>> storage.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> Some observations
> >>> 
> >>>> here are that (aside from there being a lot of code -
> >> 300-odd lines
> >>>> in a single
> >>> 
> >>>> method), the structure and error reporting makes it difficult to
> >>>> determine
> >>> 
> >>>> genuine causes of concern from situations like "this
> >> looks a bit odd
> >>>> but is
> >>> 
> >>>> probably ok".
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> An
> >>> 
> >>>> example:
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> If a commit fails
> >>> 
> >>>> (indicated by an exception being thrown), a tidy-up is
> >> executed by
> >>>> re-invoking
> >>> 
> >>>> doCommit(...) with "remove" set to true.  Essentially this is a
> >>> 
> >>>> purge.  If anything fails on the purge, a warning message
> >> is logged
> >>> 
> >>>> (including in some cases the test "but that might be
> >> ok").  It needs
> >>>> to be
> >>> 
> >>>> a warning as if the ingest failed part-way through then
> >> some things
> >>>> won&#39;t be
> >>> 
> >>>> there as expected to clean up.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> I think there&#39;s a
> >>> 
> >>>> big difference in cleaning up after a failed operation
> >> and doing a
> >>>> purge.
> >>> 
> >>>> If a purge fails to remove something that was expected then that
> >>>> should be
> >>> 
> >>>> logged as an error; but as this code is used for both
> >> clean-up and
> >>>> purging it&#39;s
> >>> 
> >>>> not possible to distinguish between the two.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> Just one example -
> >>> 
> >>>> but it highlights (a) - being able to debug.  The logging is not
> >>>> useful in
> >>> 
> >>>> terms of indicating genuine error conditions.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> I think we could do
> >>> 
> >>>> some beneficial refactoring of the existing code which hopefully
> >>>> would not risk
> >>> 
> >>>> changing existing functionality to better distinguish
> >> genuine error
> >>> 
> >>>> conditions.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> It would be useful
> >>> 
> >>>> if the various storage components (datastreams, foxml, resource
> >>>> index, registry)
> >>> 
> >>>> were wrapped within some basic transactioning/rollback
> >> capabilities
> >>>> - so that
> >>> 
> >>>> any cleanup code knows what it should be cleaning up (rather than
> >>>> attempting to
> >>> 
> >>>> clean up everything); and then anything that couldn&#39;t
> >> be cleaned
> >>>> up can be
> >>> 
> >>>> logged as an error.  Similarly any code that tries to persist a
> >>> 
> >>>> modification to one of the storage components but fails should be
> >>>> logged as an
> >>> 
> >>>> error rather than a warning.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> There is potential
> >>> 
> >>>> for making things better in this code in general - I can
> >> see there
> >>>> could be
> >>> 
> >>>> other situations leading to an inconsistency.  For example if
> >>>> managed
> >>> 
> >>>> content datastreams are sucessfully updated (or new
> >> versions added),
> >>>> but for
> >>> 
> >>>> instance the resource index update fails, then the foxml
> >> won&#39;t
> >>>> be
> >>> 
> >>>> updated and nor will the registry but the managed content updates
> >>>> will have
> >>> 
> >>>> already been persisted; so we have potential for an
> >> inconsistency.
> >>>> Again
> >>> 
> >>>> I&#39;m not sure that the current exception handling and
> >> logging is
> >>>> of that much
> >>> 
> >>>> help in identifying problematic situations.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> I think we should
> >>> 
> >>>> focus on doing what we can to improve (a) in the first
> >> instance, if
> >>>> possible
> >>> 
> >>>> without disturbing too much of the logic already embodied in this
> >>>> peice of
> >>> 
> >>>> code.  And to try and do this without negatively impacting
> >>>> performance
> >>> 
> >>>> given the focus of 3.6.
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>> 
> >>>> Steve
> >>>> 
> >>> 
> >>>>  
> >>>> 
> >>>> 
> >>> 
> >>>> 
> >>>> 
> >> 
> --------------------------------------------------------------------
> >>>> ----------
> >>>> 
> >>> 
> >>>> All the data continuously generated in your IT infrastructure
> >>>> 
> >>> 
> >>>> contains a definitive record of customers, application
> >> performance,
> >>>> 
> >>> 
> >>>> security threats, fraudulent activity, and more. Splunk 
> takes this
> >>>> 
> >>> 
> >>>> data and makes sense of it. IT sense. And common sense.
> >>>> 
> >>> 
> >>>> http://p.sf.net/sfu/splunk-novd2d
> >>>> _______________________________________________
> >>>> 
> >>> 
> >>>> Fedora-commons-developers mailing list
> >>>> 
> >>> 
> >>>> Fedora-commons-developers@lists.sourceforge.net
> >>>> <Fedora-commons-developers@lists.sourceforge.net>
> >>>> 
> >>> 
> >>>> 
> >> 
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-develope
> >>>> rs
> >>>> 
> >>> 
> >>>> 
> >>>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> ---------------------------------------------------------------------
> >> -
> >>> --------
> >>> All the data continuously generated in your IT infrastructure
> >>> contains a definitive record of customers, application 
> performance, 
> >>> security threats, fraudulent activity, and more. Splunk 
> takes this 
> >>> data and makes sense of it. IT sense. And common sense.
> >>> http://p.sf.net/sfu/splunk-novd2d
> >>> 
> >>> 
> >>> _______________________________________________
> >>> Fedora-commons-developers mailing list
> >>> Fedora-commons-developers@lists.sourceforge.net
> >>> 
> >> 
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developer
> >> s
> >> 
> >> --
> >> --
> >> Scott Prater
> >> Library, Instructional, and Research Applications (LIRA) 
> >> Division of Information Technology (DoIT) University of 
> >> Wisconsin - Madison pra...@wisc.edu
> >> 
> >> --------------------------------------------------------------
> >> ----------------
> >> All the data continuously generated in your IT infrastructure
> >> contains a definitive record of customers, application 
> performance, 
> >> security threats, fraudulent activity, and more. Splunk takes this 
> >> data and makes sense of it. IT sense. And common sense. 
> >> http://p.sf.net/sfu/splunk-novd2d 
> >> _______________________________________________
> >> Fedora-commons-developers mailing list 
> >> Fedora-commons-developers@lists.sourceforge.net
> >> 
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
> >> 
> > 
> > 
> > 
> ----------------------------------------------------------------------
> > --------
> > All the data continuously generated in your IT infrastructure 
> > contains a definitive record of customers, application performance, 
> > security threats, fraudulent activity, and more. Splunk takes this 
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > Fedora-commons-developers mailing list
> > Fedora-commons-developers@lists.sourceforge.net
> > 
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
> 
> 
> --------------------------------------------------------------
> ----------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense. 
> http://p.sf.net/sfu/splunk-novd2d 
> _______________________________________________
> Fedora-commons-developers mailing list 
> Fedora-commons-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
> 


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Re: [fcrepo-dev] Fedora failing under heavy load - debuggingtheissues

Reply via email to