Re: [fcrepo-dev] Fedora failing under heavy load - debugging theissues

Stephen Bayliss Thu, 17 Nov 2011 02:43:06 -0800

I wonder if JTA isn't the way to go on this.

But I think it would be useful to limit the scope to the objectives and
goals of 3.6; and recognise that full transaction handling is something for
the future; and we may not want to start down a track now that sets a
direction that sets us on a course that may need to change in the future.


Around the 3.6 theme of scalability: being able to debug and understand
what's going on when Fedora fails under heavy ingest/modification load with
a number of threads would seem like a useful objective.

The current "commit" code handles exceptions/errors thus, as far as I can
see (at a high level):

1) On ingest: if there is a failure
- an exception is thrown and caught
- tidy up by doing a purge, reporting any purge failures as warnings
- re-throw as ServerException
- reported as WARN in the logs by REST API exception handling

2) On a modify
- an exception is thrown and caught
- no attempt at tidy-up
- re-throw ServerException
- reported as WARN in the logs by REST API exception handling

So first stage improvements could be:

a) For each "store" (object, datastream, resource index, registry,
fieldSearch) record (just a boolean flag) whether or not changes were
successfully made.
b) Revise handling of Ingest tidy-up code: Don't use the current purge code,
but instead explicitly tidy-up whichever of the stores were updated.  Log
error messages if tidy-up did not succeed.  
c) Change error logging in the purge code to ERROR (they are WARN at the
moment because of the re-use of the purge code for ingest tidy-up)
d) Ensure an explicit ERROR is logged for the ingest failure when one of the
stores failed to be updated (currently I think the exception just bubbles up
and is reported as a WARN by eg the REST API exception handling).

This would at least give us greater insight of what's happening when there
is a failure under heavy ingest.

The next stage could be:

a) Revise other error logging in doCommit() so that an explicit ERROR is
logged if an update to a store fails.  The current exception handling for
the REST API catches the exceptions and logs these in general as WARN; so
we're unable from log analysis to easily trawl log files and distinguish
between eg invalid requests, invalid FOXML etc and a failure in persisting
modifications to an object.  There is a danger here though of confusing
errors from invalid requests (eg deleting a datastream that doesn't exist)
with genuine errors from the doCommit().  We'd need a review of
DefaultManagement code to see when detecting an error in the request is
reliant on bubbling up exceptions such as ObjectNotInLowLevelStorage from
the commit, so we could distinguish between deleting a datastream that does
not exist vs deleting a datastream failing because the expected file isn't
present - though from a quick look error handling for requests on
non-existent content isn't reliant on these.
b) Revise other error logging elsewhere to ensure we are distinguishing
between genuine repository errors (which should be logged as ERROR) and eg
requests for non-existent objects and datastreams (As an example - try
deleting the file from the object store for a managed content datastream;
and then try and delete the datastream in Fedora - you'll get a 200 response
with a WARN in the log, with a stacktrace showing that doCommit failed).

And then:

a) Implement basic transaction/rollback handling - this should be at the
level of recording what changes have been made to each individual store for
a single request so they can be backed-out in case of failure.  This would
reduce instances of an inconsistent repository when modifications fail.  But
I do wonder here whether this is straying outside of the 3.6 remit (and it
has the risk of degrading performance potentially).

Steve





> -----Original Message-----
> From: Scott Prater [mailto:pra...@wisc.edu] 
> Sent: 17 November 2011 05:54
> To: fedora-commons-developers@lists.sourceforge.net
> Subject: Re: [fcrepo-dev] Fedora failing under heavy load - 
> debugging theissues
> 
> 
> Hmmm... doesn't that just push the problem farther 
> downstream?  An ingest involves several changes to storage, 
> as Steve pointed out, and any one of them may fail, which 
> should invalidate the whole batch of changes.  And if the 
> swap fails on one or more of the changes?  We're right back 
> where we started.
> 
> I'm wondering if the built-in messaging framework could be 
> used to provide a record of transactions, a record that could 
> be then replayed or undone by another class whose sole job is 
> to undo changes in a transaction.  The only change we'd need 
> to make to the current method DefaultDOManager.doCommit() is 
> to send a message after every step that alters the state of 
> bytes in storage.  Another class, or even external service, 
> could receive those messages and log them;  and another class 
> (or service) could unroll the transactions in the event an 
> error message appears, or doCommit() catches an exception.
> 
> Once the final state change has been recorded, and doCommit() 
> is done, the transaction messages could be deleted (or not).
> 
> These messages are different from logging messages, insofar 
> as they're simply the records of attempts at state changes 
> and their exit statuses, designed to be acted upon, not 
> necessarily read in a log.  Also, if any of the rollback 
> actions fail (such as a purge), the failing action could be 
> identified in the logs as a rollback (cleanup) action, as 
> opposed to a standalone purge.
> 
> -- Scott
> 
> 
> 
> On 11/16/11, Michael Della Bitta   wrote:
> > 
> > My feeling is that a lot could be gained by writing changes 
> to a temp 
> > file and only swapping it into place if successful.
> > 
> > 
> > 
> > Michael
> > 
> > 
> > On Nov 16, 2011 3:26 PM, "Stephen Bayliss" 
> > <stephen.bayl...@acuityunlimited.net 
> > <stephen.bayl...@acuityunlimited.net>> wrote:
> > 
> > > 
> > 
> > > 
> > 
> > > 
> > 
> > > 
> > 
> > > 
> > 
> > > 
> > 
> > > 
> > 
> > > 
> > 
> > > This is following on
> > 
> > > from a conversation we had at yesterday&#39;s committer meeting 
> > > where Eddie
> > 
> > > mentioned he had a scenario with some ingests failing under heavy 
> > > load leading
> > 
> > > to potential inconsistences eg between the registry and 
> the object 
> > > and
> > 
> > > datastream store.
> > > 
> > 
> > >  
> > > 
> > 
> > > I think there are
> > 
> > > two separate types of problem here:
> > > 
> > 
> > >  
> > > 
> > 
> > > a) being able
> > 
> > > to debug to isolate the cause of a failure
> > > 
> > 
> > > b) being able
> > 
> > > to fix the cause of the failure itself
> > > 
> > 
> > >  
> > > 
> > 
> > > I have had cause
> > 
> > > recently to look again at DefaultDOManager.doCommit(...) 
> - which is 
> > > essentially
> > 
> > > where a new/modified/deleted digital object is committed to
> > 
> > > storage.
> > > 
> > 
> > >  
> > > 
> > 
> > > Some observations
> > 
> > > here are that (aside from there being a lot of code - 
> 300-odd lines 
> > > in a single
> > 
> > > method), the structure and error reporting makes it difficult to 
> > > determine
> > 
> > > genuine causes of concern from situations like "this 
> looks a bit odd 
> > > but is
> > 
> > > probably ok".
> > > 
> > 
> > >  
> > > 
> > 
> > > An
> > 
> > > example:
> > > 
> > 
> > >  
> > > 
> > 
> > > If a commit fails
> > 
> > > (indicated by an exception being thrown), a tidy-up is 
> executed by 
> > > re-invoking
> > 
> > > doCommit(...) with "remove" set to true.  Essentially this is a
> > 
> > > purge.  If anything fails on the purge, a warning message 
> is logged
> > 
> > > (including in some cases the test "but that might be 
> ok").  It needs 
> > > to be
> > 
> > > a warning as if the ingest failed part-way through then 
> some things 
> > > won&#39;t be
> > 
> > > there as expected to clean up.
> > > 
> > 
> > >  
> > > 
> > 
> > > I think there&#39;s a
> > 
> > > big difference in cleaning up after a failed operation 
> and doing a 
> > > purge.
> > 
> > > If a purge fails to remove something that was expected then that 
> > > should be
> > 
> > > logged as an error; but as this code is used for both 
> clean-up and 
> > > purging it&#39;s
> > 
> > > not possible to distinguish between the two.
> > > 
> > 
> > >  
> > > 
> > 
> > > Just one example -
> > 
> > > but it highlights (a) - being able to debug.  The logging is not 
> > > useful in
> > 
> > > terms of indicating genuine error conditions.
> > > 
> > 
> > >  
> > > 
> > 
> > > I think we could do
> > 
> > > some beneficial refactoring of the existing code which hopefully 
> > > would not risk
> > 
> > > changing existing functionality to better distinguish 
> genuine error
> > 
> > > conditions.
> > > 
> > 
> > >  
> > > 
> > 
> > > It would be useful
> > 
> > > if the various storage components (datastreams, foxml, resource 
> > > index, registry)
> > 
> > > were wrapped within some basic transactioning/rollback 
> capabilities 
> > > - so that
> > 
> > > any cleanup code knows what it should be cleaning up (rather than 
> > > attempting to
> > 
> > > clean up everything); and then anything that couldn&#39;t 
> be cleaned 
> > > up can be
> > 
> > > logged as an error.  Similarly any code that tries to persist a
> > 
> > > modification to one of the storage components but fails should be 
> > > logged as an
> > 
> > > error rather than a warning.
> > > 
> > 
> > >  
> > > 
> > 
> > > There is potential
> > 
> > > for making things better in this code in general - I can 
> see there 
> > > could be
> > 
> > > other situations leading to an inconsistency.  For example if 
> > > managed
> > 
> > > content datastreams are sucessfully updated (or new 
> versions added), 
> > > but for
> > 
> > > instance the resource index update fails, then the foxml 
> won&#39;t 
> > > be
> > 
> > > updated and nor will the registry but the managed content updates 
> > > will have
> > 
> > > already been persisted; so we have potential for an 
> inconsistency.  
> > > Again
> > 
> > > I&#39;m not sure that the current exception handling and 
> logging is 
> > > of that much
> > 
> > > help in identifying problematic situations.
> > > 
> > 
> > >  
> > > 
> > 
> > > I think we should
> > 
> > > focus on doing what we can to improve (a) in the first 
> instance, if 
> > > possible
> > 
> > > without disturbing too much of the logic already embodied in this 
> > > peice of
> > 
> > > code.  And to try and do this without negatively impacting 
> > > performance
> > 
> > > given the focus of 3.6.
> > > 
> > 
> > >  
> > > 
> > 
> > > Steve
> > > 
> > 
> > >  
> > > 
> > > 
> > 
> > > 
> > > 
> --------------------------------------------------------------------
> > > ----------
> > > 
> > 
> > > All the data continuously generated in your IT infrastructure
> > > 
> > 
> > > contains a definitive record of customers, application 
> performance,
> > > 
> > 
> > > security threats, fraudulent activity, and more. Splunk takes this
> > > 
> > 
> > > data and makes sense of it. IT sense. And common sense.
> > > 
> > 
> > > http://p.sf.net/sfu/splunk-novd2d 
> > > _______________________________________________
> > > 
> > 
> > > Fedora-commons-developers mailing list
> > > 
> > 
> > > Fedora-commons-developers@lists.sourceforge.net 
> > > <Fedora-commons-developers@lists.sourceforge.net>
> > > 
> > 
> > > 
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-develope
> > > rs
> > > 
> > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > 
> ----------------------------------------------------------------------
> > --------
> > All the data continuously generated in your IT infrastructure 
> > contains a definitive record of customers, application performance, 
> > security threats, fraudulent activity, and more. Splunk takes this 
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > 
> > 
> > _______________________________________________
> > Fedora-commons-developers mailing list 
> > Fedora-commons-developers@lists.sourceforge.net
> > 
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
> 
> --
> -- 
> Scott Prater
> Library, Instructional, and Research Applications (LIRA) 
> Division of Information Technology (DoIT) University of 
> Wisconsin - Madison pra...@wisc.edu
> 
> --------------------------------------------------------------
> ----------------
> All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity, and more. Splunk takes this 
> data and makes sense of it. IT sense. And common sense. 
> http://p.sf.net/sfu/splunk-novd2d 
> _______________________________________________
> Fedora-commons-developers mailing list 
> Fedora-commons-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
> 


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Re: [fcrepo-dev] Fedora failing under heavy load - debugging theissues

Reply via email to