Remarks inline.
----- "Hartmut Reuter" <[email protected]> wrote:
> Tom Keiser wrote:
> > my forthcoming paper, which I outlined in the Gerrit commentary:
> >
> > 1) extended callbacks cannot be implemented
>
> Why not?
I think potentially, the issue is this (two issues, one real, one apparent and
projected onto xcb)
(I'm doing some thinking aloud here, and with imperfect knowledge of Tom's
objections--please forgive any misstatements.)
Issue 1: Apparent Operation Order, Conflicting Stores
Tom's concern involves non-atomicity of dataversion increments. As I suggested
in my afsbpw 2009 talk, what matters is that there is a consistant, -apparent-
operation order (here dataversion order) shared by all parties. I would expect
that the fileserver (not OSD servers) will continue to be responsible for
extended callback delivery, and will do so for StoreData operations on receipt
of the final store which succeeds operations on the involved OSD(s).
As Tom notes later, if two clients A and B were attempting to write, "nearly
simultaneously," the same range R in the same file F, on a mirrored volume, the
final stored state might represent partly the data stored by A, and partly the
data stored by B--this fact will, Tom is inferring, not actually be known by
the coordinating fileserver. To each change, corresponds an increment of
DV(F). Consider that A and B had cached date in R. If A and B and the server
support extended callbacks, then A and B will each receive a sequence of two
extended callback StoreData notifications each with new DVs, the changed ranges
listed, identified by the client that originated the change. As noted, since
the coordinating fileserver doesn't actually know what bytes were written in
the interval, it cannot send the correct range invalidates to the clients. If
it attempts to send the range invalidates corresponding to the contributing
changes, there is a risk that either A, B, or even A and B will incorrectly
retain stale data on receipt of the callback.
However. This problem isn't insoluble. It appears to me that it can be solved
at two levels (i.e., potentially, solved once initially, and a better solution
implemented later):
1. it can be solved through protocol extension (more below)
2. it can be solved through conservative inference at the coordinating
fileserver, perhaps with very limited protocol enhancement (additional xcb flag)
In 2, the fileserver, knowing that F is on a mirrored OSD, infers that the data
in all of R is potentially invalid at A and B, and therefore, sends to both
clients a single notification, from arbitrary origin (could be A or B, or 0),
with a flag indicating that the range is strictly invalidated, without
consideration of origin or DV. The new DV is the highest DV in the interval.
The state of potentially cached data of F outside R is not affected (may be
retained).
Issue 2: Read Instability, Intersecting Store (Dirty Read)
Here, A is a client initiating a store on F, and B is a client interested in
data in a range R on a file F, which it has not cached. These operations are
executed "nearly simultaneously." In legacy AFS3, with operations coordinated
at the store site, the B's view of DV(F) when its read on R completes, is the
data last stably stored at DV. In OSD, that data may more recent than the DV.
That data may be inconsistent with respect to the state of subranges of R that
may be on different OSDs.
However. All changes are still coordinated on a single fileserver, and that
fileserver is the one responsible for delivery of notifications, as described.
In the scenario just sketched, B's dirty read will be -undone-, and it's view
of the correct DV(F) corrected, as soon as A's coordinating store completes. B
will receive an extended callback StoreData notification on the affected range
of A's logical store. It will be forced to re-read everything in the range it
remains interested in. Any data it had read originally that was inconsistent,
will be invalidated. So this is not, in fact a problem with extended callback
delivery at all--it is a consistency change, discussed below. (It is, as noted
by Hartmut and/or Felix, one clients can reliably contain via existing and
proposed locking mechanisms.)
Tom, can you provide additional clarification about your concern as regards
extended callbacks?
>
> > 2) mandatory locking semantics cannot be implemented
>
> Why not? Once it is implemented for FatchData ans StoreData the same
> code could be used in GetOSDlocation.
I believe it can be implemented, but it requires protocol enhancement, probably
a reservation-based mechanism coordinated (initially) through the fileserver.
Also, clearly, mandatory lock enforcement is -NOT- in AFS3 semantics, by
definition. Something new is being elaborated, but it has already been
proposed for community review in all versions of the byte-range locking draft.
Which is not to say that draft cannot be revised, if appropriate.
>
> > 4) lack of write-atomicity entirely changes the meaning of DV in the
> protocol
>
> If you want read and write from different clients at the same time in
> a
> way that can produce inconsistencies you should use advisory locking.
> Also without OSD an intermediate StoreData because of the cache
> becoming
> full may leed to an intermediate data version which from the
> application's point of view is inconsistent.
I think I more agree with the "then use locking" response than with the
objection, but not unequivocally. Tom's use of language here ("entirely
changes the meaning") appears to me to be an attempt to nail down a meaning for
DV that it never had--but I say this as someone who would like to incorporate
stronger, negotiated semantics in the AFS protocol.
In my view, long term, AFS protocol extension is required, not merely to
deliver enhanced (or reduced) semantics, but, in fact, a) well defined, and b)
negotiable semantics, such that clients and servers know and agree on the
semantics that hold for operations on specific objects. I think we must
consider that going forward, the capabilities of different implementations and
provisioning choices necessarily means coexistence of objects whose consistency
guarantees are different. A key objective for us, in design of future
protocol, should be to make this fact visible and useful in the protocol. As I
state again further on, I do not think that "OSDs have this fixed (reduced)
consistency is adequate, long term, though it seems silly to do anything but
accept that until the alternative is available.
>
> > 5) the data "mirroring" capability is race-prone, and thus not
> consistent
What I understand this to mean is that, for example, if two clients A and B
were attempting to write, "nearly simultaneously," the same range in the same
file, on a mirrored volume, the final stored state might represent partly the
data stored by A, and partly the data stored by B, and also, as noted earlier,
that data read by B overlapping with a store by A may not reflect a consistent
state of F when A's store completes (and during the store interval, B may have
different data for F than do other clients which had cached on F, at the same
DV).
It appears to me that, current AFS3 implementations don't permit these
scenarios to happen, and to that degree, they not permitted under AFS3
semantics. We could argue about whether that's true, however, even in general
(and the two cases are distinct). But even if we take the strict view, we
have, as Hartmut notes, not established that the semantics of AFS3+OSD need be
those of legacy AFS3 in all respects.
>
> > 7) insufficient information is communicated on the wire to support
> the
> > distributed transaction management necessary to ensure atomicity of
> > various operations
>
> I admit reading and writing data from/to OSDs is not as atomic as
> doing
> the same to a classical AFS fileserver. And I think to make it atomic
> would require substantial overhead and even more complexity. Therefore
> I
> think files stored in OSDs never should and will replace normal AFS
> files totally, but this technique should be used where the usage
> pattern
> of the files does not require such atomicy.
I do think that, relative to enhanced semantics options we will wish to support
in future protocol revisions, there should be a fuller discussion. Again, the
topic feels clearly forward-looking to me--not about the current semantics of
rxOSD. I think that as the discussion has already hinted, the most interesting
areas to examine first are in the direction of negotiated protocol levels,
supporting mandatory locking, reservations, IO hints, etc.
Matt
--
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel