Stewart Stremler wrote:
(1) Writes and reads are atomic, but may not be persistent. That is,
the data in a file will not ever be corrupt
Um, well, they're atomic and persistent. But if two people write at the
same time, you'll get whichever the server thought came later. You know,
like a database transaction.
(2) Data in a file might be silently replaced at any time.
I don't know what you mean by "silently". It may be replaced by some
other process writing to it, just like a row in a database.
(3) Files may be silently removed at any time.
I don't know what you mean by "silently". It can be delete by some other
process at any time, just like a row in a database. Just like if you
take a directory listing and then try to open a file in the listing,
there's no guarantee the file is still there.
(4) No latency guarantees are made for how long a change to become
universal.
Apparently not. In practice, it seems to be at most a few seconds.
(5) No ordering guarantees are made for which server updates which
of its siblings when, or how.
I don't know which "siblings" you're talking about. A consistent view
(for some meaning of "consistent") is presented to the users of the
service. How the service is handled internally is irrelevant to the users.
...let's call your machine Y, the servers S#, and Murphy's machine M.
1) Negotiate who is the master:
1A) If "Elected" exists, copy its contents to the file named "10.0.0.5"
and go to phase 3 if it matches my IP address, phase 2 if it doesn't.
So we read "Elected", keep the value, and then write whatever it was
back to a unique-per-machine (No concurrent use on a single machine)
filename.
Hm.
What if it ("Elected") doesn't exist?
Then you don't do anything.
if (x == 0) {printf("Yep!\n");}
What if X != 0?
1B) If "Nominated" exists, copy its contents to the file named
"10.0.0.5" and go to 1D.
Presumably, 1C is when "Nominated" doesn't exist?
Yes.
1C) Write the string "10.0.0.5" into "10.0.0.5" and into "Nominated".
(Here, we don't think either of those files exist, so we think we're the
first to come online. Yes, this is a race condition. See 1F below.)
Y->S1: Does 'Nominated' exist?
S1->Y: No
M->S2: Does 'Nominated' exist?
S2->M: No.
M->S2: Write "10.6.6.6" into "Nominated"
M->S2: Write "10.6.6.6." into "10.6.6.6"
Y->S1: Write "10.0.0.5" into "Nominated"
Y->S1: Write "10.0.0.5" into "10.0.0.5"
Correct. Um, except there's only one S. The only thing that S1 and S2
throws into the mix is that if The first M->S and Y->S both happen
concurrently(*) then you don't know which of the two possible values
wind up in "Nominated".
(*) Concurrently: The second one to start starts before the first one ends.
1D) Read "Nominated". If it does not contain your own IP address and it
does not contain what you most recently wrote into the "10.0.0.5" file,
write the contents of Nominated into your own "10.0.0.5" file and go
back to 1A.
Y->S1: Give me 'Nominated'
S1->Y: Here's 'Nominated'
So Y now has 10.0.0.5
S1->S3: Sync Event: here's a file named "Nominated".
S3 now has 10.0.0.5
S2->S1: Sync Event: here's a file named "Nominated".
S1 now has 10.0.0.6
S3->S2: Sync Event: here's a file named "Nominated".
S2 now has 10.0.0.5
M->S2: Give me 'Nominated'
S2->M: Here's 'Nominated'
M now has 10.0.0.5
So Y has 10.6.6.6 and M has 10.0.0.5 as the contents of Nominated.
No it doesn't, according to your steps.
I'll grant you there is probably a sequence of updates that would cause
10.0.0.5 and 10.0.0.6 to swap places, but, well, that would be a pretty
broken way to implement a distributed file system, now wouldn't it?
Where if two people update a file, then forever after two different
people reading it at the same time could get two different values?
Potential infinite loop.
Only if forever I can fail to see the updates written by other people.
While the documentation doesn't actually give any liveness guarantees,
I'd expect that to be a pretty broken implementation of a distributed
file system.
1E) If "Nominated" contained your own IP address, read every other file
in the directory besides "Nominated" and "Elected", and see if they all
match the contents of "Nominated". If not, go back to 1A.
So instead of the 1D transaction, we get:
Y->S1: Give me 'Nominated'
S1->Y: Here's 'Nominated'
M->S2: Give me 'Nominated'
S2->M: Here's 'Nominated'
Y: Good, it's me.
Y->S1: Give me all files in directory.
S1->Y: Files are 'Nominated' and '10.0.0.5'
M: Good, it's me.
M->S2: Give me all files in directory.
S2->M: Files are 'Nominated' and '10.6.6.6'
Well, yes, again if you think I'm storing separate files on separate
services or something, yes, it would be broken. But I'm not. It's one
bucket on one service, with the guarantee that eventually (and
presumedly promptly[+]) you get the results you're expecting.
[+] Promptly: within the normal latency time for propagation inside
Amazon's network. Note that even FTP doesn't guarantee that the file you
wrote a week ago will be updated with the contents you stored, for example.
1F) Here, "Nominated" agrees with every machine's concept of what's in
Nominated. I.e., every machine has read the same value out of
"Nominated", and everyone read the same value, and it's ME! So I'm
elected. Write "10.0.0.5" into Elected. Go to phase 3.
Y->S1: Write "10.0.0.5" to "Elected".
M->S2: Write "10.6.6.6" to "Elected".
We have two masters. Whoops.
You persist in behaving as if I'm talking to two different services. Not
sure why you think this is the case. You're also blaming failures of the
file system on my algorithm.
If I have a two-disk RAID, I don't start postulating situations where
two writers write to the same file and get different answers. I have to
account for two users writing to the same file. I don't have to account
for each user seeing their own data persistently.
Spinloop?
Yes.
This is those processes who don't think they are master.
Yes.
Who deletes ASSIGNED?
I'm simplifying a bit. There's actually multiple assigned and statuses,
along with log files that the servant writes and the master reads, for
example.
3) The assign-work steps:
3A) I'm master. Scan the directory looking for something that begins
with "READY". If there's a matching "STATUS", read the status file and
store the results, then delete the STATUS file.
You can lose (a lot?) of the data that way, if the process writes to STATUS
between when you read it and when you deleted it.
Nobody writes to the status file twice in a row. Remember that writes
are atomic, so I bundle up all the data I want for the status (which is
about 2 lines) and write it.
3B) Look for a "READY" without a "STATUS". When found, pick a job,
assign it by writing it into "ASSIGNED" with the same extension. If none
are left, write "EXIT" there.
This REALLY depends on not having two masters.
Yes.
3C) If there are no "READY" files and no jobs left, exit.
Let's say that M got to be master, and Y didn't. So Y posts a READY,
and M doesn't see it, and so exits. Y is waiting on M, and M is done
and gone.
So you'd need a watchdog timer....
Everyone runs watchdog timers, yes. As I said, I simplified because the
discussion was about how to do lockless synchronization, not the details
of assigning jobs, reading log files, etc.
For example, when M exits, it deletes its own status file, and Y sees
that go away and realizes there's no master, so it logs that it's
exiting for lack of master and exits too.
3D) Well, you get the idea.
Note that other than the initial "Nominated" file, nobody ever writes to
a file at the same time as anyone else. Indeed, I don't think there's
any file which two processes ever write to.
There's "Nominated" and "Elected".
Right. Only the elected machine ever writes to Elected.
Since the writes are atomic, I'm not entirely sure two files gives you
anything.
The creation of Elected signals that the master has been chosen. It's
also a slight efficiency thing, in that if Elected exists, a new slave
coming online doesn't have to look at more than one file to see who the
master is.
Using the IP address (actually, IP address + port number or pid might be
the better way to go, otherwise two processes on the same machine could
screw things up pretty bad) to partition the information is a clever
thing to do.
Yeah, I ran into that a bit, when I realized the Master and one Slave
could both run on the same machine. It's a little bit ugly, and I might
rework it some when the time comes to clean it up for further release
(which will be when Tcl 8.5 goes gold).
And it's actually the host name. But you can set it to actually identify
itself however you want, and run out of whichever directory you like. In
my testing, I often had five slaves, five competing masters, and the GUI
all running on my pour single-core machine here. :-)
--
Darren New / San Diego, CA, USA (PST)
His kernel fu is strong.
He studied at the Shao Linux Temple.
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg