[fcrepo-dev] Fedora failing under heavy load - debugging the issues

Stephen Bayliss Wed, 16 Nov 2011 12:26:24 -0800

This is following on from a conversation we had at yesterday's committer
meeting where Eddie mentioned he had a scenario with some ingests failing
under heavy load leading to potential inconsistences eg between the registry
and the object and datastream store.
 
I think there are two separate types of problem here:
 
a) being able to debug to isolate the cause of a failure
b) being able to fix the cause of the failure itself
 
I have had cause recently to look again at DefaultDOManager.doCommit(...) -
which is essentially where a new/modified/deleted digital object is
committed to storage.
 
Some observations here are that (aside from there being a lot of code -
300-odd lines in a single method), the structure and error reporting makes
it difficult to determine genuine causes of concern from situations like
"this looks a bit odd but is probably ok".
 
An example:
 
If a commit fails (indicated by an exception being thrown), a tidy-up is
executed by re-invoking doCommit(...) with "remove" set to true.
Essentially this is a purge.  If anything fails on the purge, a warning
message is logged (including in some cases the test "but that might be ok").
It needs to be a warning as if the ingest failed part-way through then some
things won't be there as expected to clean up.
 
I think there's a big difference in cleaning up after a failed operation and
doing a purge.  If a purge fails to remove something that was expected then
that should be logged as an error; but as this code is used for both
clean-up and purging it's not possible to distinguish between the two.
 
Just one example - but it highlights (a) - being able to debug.  The logging
is not useful in terms of indicating genuine error conditions.
 
I think we could do some beneficial refactoring of the existing code which
hopefully would not risk changing existing functionality to better
distinguish genuine error conditions.
 
It would be useful if the various storage components (datastreams, foxml,
resource index, registry) were wrapped within some basic
transactioning/rollback capabilities - so that any cleanup code knows what
it should be cleaning up (rather than attempting to clean up everything);
and then anything that couldn't be cleaned up can be logged as an error.
Similarly any code that tries to persist a modification to one of the
storage components but fails should be logged as an error rather than a
warning.
 
There is potential for making things better in this code in general - I can
see there could be other situations leading to an inconsistency.  For
example if managed content datastreams are sucessfully updated (or new
versions added), but for instance the resource index update fails, then the
foxml won't be updated and nor will the registry but the managed content
updates will have already been persisted; so we have potential for an
inconsistency.  Again I'm not sure that the current exception handling and
logging is of that much help in identifying problematic situations.
 
I think we should focus on doing what we can to improve (a) in the first
instance, if possible without disturbing too much of the logic already
embodied in this peice of code.  And to try and do this without negatively
impacting performance given the focus of 3.6.
 
Steve

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

[fcrepo-dev] Fedora failing under heavy load - debugging the issues

Reply via email to