Re: [OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range Locking, etc.

Matt W. Benjamin Thu, 23 Jul 2009 08:55:04 -0700

Remarks inline.

----- "Hartmut Reuter" <[email protected]> wrote:

> Tom Keiser wrote:
> > my forthcoming paper, which I outlined in the Gerrit commentary:
> > 
> > 1) extended callbacks cannot be implemented
> 
> Why not?

I think potentially, the issue is this (two issues, one real, one apparent and 
projected onto xcb)

(I'm doing some thinking aloud here, and with imperfect knowledge of Tom's 
objections--please forgive any misstatements.)

Issue 1:  Apparent Operation Order, Conflicting Stores

Tom's concern involves non-atomicity of dataversion increments.  As I suggested 
in my afsbpw 2009 talk, what matters is that there is a consistant, -apparent- 
operation order (here dataversion order) shared by all parties.  I would expect 
that the fileserver (not OSD servers) will continue to be responsible for 
extended callback delivery, and will do so for StoreData operations on receipt 
of the final store which succeeds operations on the involved OSD(s).

As Tom notes later, if two clients A and B were attempting to write, "nearly 
simultaneously," the same range R in the same file F, on a mirrored volume, the 
final stored state might represent partly the data stored by A, and partly the 
data stored by B--this fact will, Tom is inferring, not actually be known by 
the coordinating fileserver.  To each change, corresponds an increment of 
DV(F).  Consider that A and B had cached date in R.  If A and B and the server 
support extended callbacks, then A and B will each receive a sequence of two 
extended callback StoreData notifications each with new DVs, the changed ranges 
listed, identified by the client that originated the change.  As noted, since 
the coordinating fileserver doesn't actually know what bytes were written in 
the interval, it cannot send the correct range invalidates to the clients.  If 
it attempts to send the range invalidates corresponding to the contributing 
changes, there is a risk that either A, B, or even A and B will incorrectly 
retain stale data on receipt of the callback.

However.  This problem isn't insoluble.  It appears to me that it can be solved 
at two levels (i.e., potentially, solved once initially, and a better solution 
implemented later):

1. it can be solved through protocol extension (more below)
2. it can be solved through conservative inference at the coordinating 
fileserver, perhaps with very limited protocol enhancement (additional xcb flag)

In 2, the fileserver, knowing that F is on a mirrored OSD, infers that the data 
in all of R is potentially invalid at A and B, and therefore, sends to both 
clients a single notification, from arbitrary origin (could be A or B, or 0), 
with a flag indicating that the range is strictly invalidated, without 
consideration of origin or DV.  The new DV is the highest DV in the interval.  
The state of potentially cached data of F outside R is not affected (may be 
retained).

Issue 2:  Read Instability, Intersecting Store (Dirty Read)

Here, A is a client initiating a store on F, and B is a client interested in 
data in a range R on a file F, which it has not cached.  These operations are 
executed "nearly simultaneously."  In legacy AFS3, with operations coordinated 
at the store site, the B's view of DV(F) when its read on R completes, is the 
data last stably stored at DV.  In OSD, that data may more recent than the DV.  
That data may be inconsistent with respect to the state of subranges of R that 
may be on different OSDs.

However.  All changes are still coordinated on a single fileserver, and that 
fileserver is the one responsible for delivery of notifications, as described.  
In the scenario just sketched, B's dirty read will be -undone-, and it's view 
of the correct DV(F) corrected, as soon as A's coordinating store completes.  B 
will receive an extended callback StoreData notification on the affected range 
of A's logical store.  It will be forced to re-read everything in the range it 
remains interested in.  Any data it had read originally that was inconsistent, 
will be invalidated.  So this is not, in fact a problem with extended callback 
delivery at all--it is a consistency change, discussed below.  (It is, as noted 
by Hartmut and/or Felix, one clients can reliably contain via existing and 
proposed locking mechanisms.)

Tom, can you provide additional clarification about your concern as regards 
extended callbacks?

> 
> > 2) mandatory locking semantics cannot be implemented
> 
> Why not? Once it is implemented for FatchData ans StoreData the same
> code could be used in GetOSDlocation.

I believe it can be implemented, but it requires protocol enhancement, probably 
a reservation-based mechanism coordinated (initially) through the fileserver.  
Also, clearly, mandatory lock enforcement is -NOT- in AFS3 semantics, by 
definition.  Something new is being elaborated, but it has already been 
proposed for community review in all versions of the byte-range locking draft.  
Which is not to say that draft cannot be revised, if appropriate.

> 
> > 4) lack of write-atomicity entirely changes the meaning of DV in the
> protocol
> 
> If you want read and write from different clients at the same time in
> a
> way that can produce inconsistencies you should use advisory locking.
> Also without OSD an intermediate StoreData because of the cache
> becoming
> full may leed to an intermediate data version which from the
> application's point of view is inconsistent.

I think I more agree with the "then use locking" response than with the 
objection, but not unequivocally.  Tom's use of language here ("entirely 
changes the meaning") appears to me to be an attempt to nail down a meaning for 
DV that it never had--but I say this as someone who would like to incorporate 
stronger, negotiated semantics in the AFS protocol.

In my view, long term, AFS protocol extension is required, not merely to 
deliver enhanced (or reduced) semantics, but, in fact, a) well defined, and b) 
negotiable semantics, such that clients and servers know and agree on the 
semantics that hold for operations on specific objects.  I think we must 
consider that going forward, the capabilities of different implementations and 
provisioning choices necessarily means coexistence of objects whose consistency 
guarantees are different.  A key objective for us, in design of future 
protocol, should be to make this fact visible and useful in the protocol.  As I 
state again further on, I do not think that "OSDs have this fixed (reduced) 
consistency is adequate, long term, though it seems silly to do anything but 
accept that until the alternative is available.

> 
> > 5) the data "mirroring" capability is race-prone, and thus not
> consistent

What I understand this to mean is that, for example, if two clients A and B 
were attempting to write, "nearly simultaneously," the same range in the same 
file, on a mirrored volume, the final stored state might represent partly the 
data stored by A, and partly the data stored by B, and also, as noted earlier, 
that data read by B overlapping with a store by A may not reflect a consistent 
state of F when A's store completes (and during the store interval, B may have 
different data for F than do other clients which had cached on F, at the same 
DV).  

It appears to me that, current AFS3 implementations don't permit these 
scenarios to happen, and to that degree, they not permitted under AFS3 
semantics.  We could argue about whether that's true, however, even in general 
(and the two cases are distinct).  But even if we take the strict view, we 
have, as Hartmut notes, not established that the semantics of AFS3+OSD need be 
those of legacy AFS3 in all respects.

> 
> > 7) insufficient information is communicated on the wire to support
> the
> > distributed transaction management necessary to ensure atomicity of
> > various operations
> 
> I admit reading and writing data from/to OSDs is not as atomic as
> doing
> the same to a classical AFS fileserver. And I think to make it atomic
> would require substantial overhead and even more complexity. Therefore
> I
> think files stored in OSDs never should and will replace normal AFS
> files totally, but this technique should be used where the usage
> pattern
> of the files does not require such atomicy.

I do think that, relative to enhanced semantics options we will wish to support 
in future protocol revisions, there should be a fuller discussion.  Again, the 
topic feels clearly forward-looking to me--not about the current semantics of 
rxOSD.  I think that as the discussion has already hinted, the most interesting 
areas to examine first are in the direction of negotiated protocol levels, 
supporting mandatory locking, reservations, IO hints, etc.

Matt

-- 

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] convergence of RxOSD, Extended Call Backs, Byte Range Locking, etc.

Reply via email to