Re: SVN-4877 FSFS commit failure should release txn proto-rev lock

Julian Foad Tue, 29 Jun 2021 03:33:45 -0700

Daniel Shahaf wrote:
> Julian Foad wrote on Fri, Jun 25, 2021 at 14:27:30 +0100:
> > Now filed as: https://issues.apache.org/jira/browse/SVN-4877
> 
> "Now"?  The issue is a month old.


You're absolutely right!  That's when I was made aware this issue had come up 
again.  First I noticed we had not previously filed an issue, so I started by 
filing it as #4877, and started drafting this email.  With life getting in the 
way, it has taken me until now (ahem, I mean June 25th) to re-check everything 
and put this email together with the clearest explanation I could.

Thank you for taking another look at this.

There could be two angles to view this issue. One angle is looking at the 
specifics of why WD's replicator wants different behaviour, which I will go 
into further below.

Another angle could be looking from a software design point of view and saying 
of course code like this should clean up like this. I am hopeful that someone 
may be able to see it from this angle and say of course this change is right in 
principle, and maybe point out if I got the specifics right or missed something.


[...]

> Okay, so it fails with a wrong error message.

In a case where the underlying cause of the error has remained the same, that 
would be a correct statement; in a case where the underlying cause was 
transient and has gone away, it could have succeeded on re-try, and the problem 
is it is failing where it should succeed.  Both of these cases are problematic, 
as explained further below.

> > Subversion itself doesn't care. It never re-tries at this level. It always 
> > discards the transaction if an error is returned.
> 
> Even under mod_dav_svn when the client is a generic DAV client?

Yes: in dav_svn__checkin(), if svn_repos_fs_commit_txn() fails before 
finalizing the commit, then -> dav_svn__abort_txn -> svn_fs_abort_txn.


> > DESIRED BEHAVIOUR
> > 
> > The commit re-try should fail in exactly the same way as the first try, if 
> > the underlying cause of the error persists.
> > 
> > Or, if the underlying cause of the error was transient and has gone away, 
> > then the commit should succeed.
> 
> I think it's possible that the second try would legitimately fail with
> a different error message than the first try.

Yes, that remains a third possibility, where the underlying cause of the error 
changed. The problematic cases are the other two.

In the case of a persistent underlying error condition, as best I understand, 
the concern is that the replicator should be able to report the error condition 
consistently. Why does it matter? Remember that the client (of fsfs) in this 
case is managing replicated repositories. From one high-level commit operation, 
it creates several FSFS commit operations on the several replicas. When these 
replicas return errors, it needs to be able to consolidate these errors into a 
single result to return to the higher level svn code. If all the replicas have 
the same underlying cause of error, and they all return the same error code, 
the replicator can consolidate those into a single result and return that to 
the higher level caller which will then do what Subversion does, and that's 
fine. But if some replicas return different error codes from others, the 
replicator has to assume that replication has failed and the repositories are 
out of sync in some way, and that is a "hard error" condition requiring manual 
intervention. (Something like that; I don't know all the details.)

On top of that, before assuming that an error is persistent, the replicator 
needs be able to re-try operations, in general and at will, because at various 
levels there can be temporary conditions that are local to one replica which 
may go away on re-try and end up successful. Although we are focusing on the 
internals of a FSFS API here, in general these re-try scenarios extend to 
things like re-trying if the operation is interrupted by the machine crashing 
and restarting. The operations used need to be idempotent to guarantee this 
kind of re-trying and recovery.

So, if a FSFS operation returns an error in one replica, the replicator first 
re-tries on that replica in case it was a local and temporary error. If the 
re-try also fails, then it is time to report the error, and the replicator will 
compare this with the error (or success) result that the other replicas have 
encountered, to decide whether it is a consistent response that should be 
handed back to the higher level caller, or if it is inconsistent then the 
replicas have gone out of sync and manual intervention is necessary to repair 
the the system.

Bear in mind that the replicator does not know about the specifics of all the 
possible svn error codes, it needs to apply this logic uniformly to all error 
responses (and success responses) from all APIs.

> Thus, it's not clear what the desired _change_ in behaviour is.

Is that clearer now?

> > PROPOSED FIX
> > 
> > My proposed fix is to unlock the lock when exiting that code section, no 
> > matter whether successful or throwing an error.
> > 
> > Pseudo-Python:
> > 
> >     try:
> >         1. Get a write handle on the proto revision file.
> >         2. Complete the construction of the proto-rev file.
> >         3. Move the finished rev file into place.
> >     finally:
> >         4. Unlock the proto-rev file.
> >     5. Continue with the rest of the commit finalization.
> > 
> 
> Why would this be correct?  Why would this have no downside?

Would it be easier to see from this angle: that would make the behaviour for a 
re-try within one calling process equivalent to the existing behaviour that 
would occur if the FSFS machine restarts and a new calling process re-tries.

-- 
- Julian

Re: SVN-4877 FSFS commit failure should release txn proto-rev lock

Reply via email to