Daniel Shahaf wrote: > Julian Foad wrote on Fri, Jun 25, 2021 at 14:27:30 +0100: > > Now filed as: https://issues.apache.org/jira/browse/SVN-4877 > > "Now"? The issue is a month old.
You're absolutely right! That's when I was made aware this issue had come up again. First I noticed we had not previously filed an issue, so I started by filing it as #4877, and started drafting this email. With life getting in the way, it has taken me until now (ahem, I mean June 25th) to re-check everything and put this email together with the clearest explanation I could. Thank you for taking another look at this. There could be two angles to view this issue. One angle is looking at the specifics of why WD's replicator wants different behaviour, which I will go into further below. Another angle could be looking from a software design point of view and saying of course code like this should clean up like this. I am hopeful that someone may be able to see it from this angle and say of course this change is right in principle, and maybe point out if I got the specifics right or missed something. [...] > Okay, so it fails with a wrong error message. In a case where the underlying cause of the error has remained the same, that would be a correct statement; in a case where the underlying cause was transient and has gone away, it could have succeeded on re-try, and the problem is it is failing where it should succeed. Both of these cases are problematic, as explained further below. > > Subversion itself doesn't care. It never re-tries at this level. It always > > discards the transaction if an error is returned. > > Even under mod_dav_svn when the client is a generic DAV client? Yes: in dav_svn__checkin(), if svn_repos_fs_commit_txn() fails before finalizing the commit, then -> dav_svn__abort_txn -> svn_fs_abort_txn. > > DESIRED BEHAVIOUR > > > > The commit re-try should fail in exactly the same way as the first try, if > > the underlying cause of the error persists. > > > > Or, if the underlying cause of the error was transient and has gone away, > > then the commit should succeed. > > I think it's possible that the second try would legitimately fail with > a different error message than the first try. Yes, that remains a third possibility, where the underlying cause of the error changed. The problematic cases are the other two. In the case of a persistent underlying error condition, as best I understand, the concern is that the replicator should be able to report the error condition consistently. Why does it matter? Remember that the client (of fsfs) in this case is managing replicated repositories. From one high-level commit operation, it creates several FSFS commit operations on the several replicas. When these replicas return errors, it needs to be able to consolidate these errors into a single result to return to the higher level svn code. If all the replicas have the same underlying cause of error, and they all return the same error code, the replicator can consolidate those into a single result and return that to the higher level caller which will then do what Subversion does, and that's fine. But if some replicas return different error codes from others, the replicator has to assume that replication has failed and the repositories are out of sync in some way, and that is a "hard error" condition requiring manual intervention. (Something like that; I don't know all the details.) On top of that, before assuming that an error is persistent, the replicator needs be able to re-try operations, in general and at will, because at various levels there can be temporary conditions that are local to one replica which may go away on re-try and end up successful. Although we are focusing on the internals of a FSFS API here, in general these re-try scenarios extend to things like re-trying if the operation is interrupted by the machine crashing and restarting. The operations used need to be idempotent to guarantee this kind of re-trying and recovery. So, if a FSFS operation returns an error in one replica, the replicator first re-tries on that replica in case it was a local and temporary error. If the re-try also fails, then it is time to report the error, and the replicator will compare this with the error (or success) result that the other replicas have encountered, to decide whether it is a consistent response that should be handed back to the higher level caller, or if it is inconsistent then the replicas have gone out of sync and manual intervention is necessary to repair the the system. Bear in mind that the replicator does not know about the specifics of all the possible svn error codes, it needs to apply this logic uniformly to all error responses (and success responses) from all APIs. > Thus, it's not clear what the desired _change_ in behaviour is. Is that clearer now? > > PROPOSED FIX > > > > My proposed fix is to unlock the lock when exiting that code section, no > > matter whether successful or throwing an error. > > > > Pseudo-Python: > > > > try: > > 1. Get a write handle on the proto revision file. > > 2. Complete the construction of the proto-rev file. > > 3. Move the finished rev file into place. > > finally: > > 4. Unlock the proto-rev file. > > 5. Continue with the rest of the commit finalization. > > > > Why would this be correct? Why would this have no downside? Would it be easier to see from this angle: that would make the behaviour for a re-try within one calling process equivalent to the existing behaviour that would occur if the FSFS machine restarts and a new calling process re-tries. -- - Julian