Stewart Stremler wrote:
(1) Writes and reads are atomic, but may not be persistent. That is,
the data in a file will not ever be corrupt

Um, well, they're atomic and persistent. But if two people write at the same time, you'll get whichever the server thought came later. You know, like a database transaction.

(2) Data in a file might be silently replaced at any time.

I don't know what you mean by "silently". It may be replaced by some other process writing to it, just like a row in a database.

(3) Files may be silently removed at any time.

I don't know what you mean by "silently". It can be delete by some other process at any time, just like a row in a database. Just like if you take a directory listing and then try to open a file in the listing, there's no guarantee the file is still there.

(4) No latency guarantees are made for how long a change to become
universal.

Apparently not. In practice, it seems to be at most a few seconds.

(5) No ordering guarantees are made for which server updates which
of its siblings when, or how.

I don't know which "siblings" you're talking about. A consistent view (for some meaning of "consistent") is presented to the users of the service. How the service is handled internally is irrelevant to the users.

...let's call your machine Y, the servers S#, and Murphy's machine M.

1) Negotiate who is the master:

1A) If "Elected" exists, copy its contents to the file named "10.0.0.5" and go to phase 3 if it matches my IP address, phase 2 if it doesn't.

So we read "Elected", keep the value, and then write whatever it was
back to a unique-per-machine (No concurrent use on a single machine)
filename.

Hm.

What if it ("Elected") doesn't exist?

Then you don't do anything.

if (x == 0) {printf("Yep!\n");}
What if X != 0?

1B) If "Nominated" exists, copy its contents to the file named "10.0.0.5" and go to 1D.

Presumably, 1C is when "Nominated" doesn't exist?

Yes.

1C) Write the string "10.0.0.5" into "10.0.0.5" and into "Nominated". (Here, we don't think either of those files exist, so we think we're the first to come online. Yes, this is a race condition. See 1F below.)

Y->S1: Does 'Nominated' exist?
S1->Y: No
M->S2: Does 'Nominated' exist?
S2->M: No.
M->S2: Write "10.6.6.6" into "Nominated"
M->S2: Write "10.6.6.6." into "10.6.6.6"
Y->S1: Write "10.0.0.5" into "Nominated"
Y->S1: Write "10.0.0.5" into "10.0.0.5"

Correct. Um, except there's only one S. The only thing that S1 and S2 throws into the mix is that if The first M->S and Y->S both happen concurrently(*) then you don't know which of the two possible values wind up in "Nominated".

(*) Concurrently: The second one to start starts before the first one ends.

1D) Read "Nominated". If it does not contain your own IP address and it does not contain what you most recently wrote into the "10.0.0.5" file, write the contents of Nominated into your own "10.0.0.5" file and go back to 1A.

Y->S1: Give me 'Nominated'
S1->Y: Here's 'Nominated'

So Y now has 10.0.0.5

S1->S3: Sync Event: here's a file named "Nominated".

S3 now has 10.0.0.5

S2->S1: Sync Event: here's a file named "Nominated".

S1 now has 10.0.0.6

S3->S2: Sync Event: here's a file named "Nominated".

S2 now has 10.0.0.5

M->S2: Give me 'Nominated'
S2->M: Here's 'Nominated'

M now has 10.0.0.5

So Y has 10.6.6.6 and M has 10.0.0.5 as the contents of Nominated.

No it doesn't, according to your steps.

I'll grant you there is probably a sequence of updates that would cause 10.0.0.5 and 10.0.0.6 to swap places, but, well, that would be a pretty broken way to implement a distributed file system, now wouldn't it? Where if two people update a file, then forever after two different people reading it at the same time could get two different values?

Potential infinite loop.

Only if forever I can fail to see the updates written by other people. While the documentation doesn't actually give any liveness guarantees, I'd expect that to be a pretty broken implementation of a distributed file system.

1E) If "Nominated" contained your own IP address, read every other file in the directory besides "Nominated" and "Elected", and see if they all match the contents of "Nominated". If not, go back to 1A.

So instead of the 1D transaction, we get:

Y->S1: Give me 'Nominated'
S1->Y: Here's 'Nominated'
M->S2: Give me 'Nominated'
S2->M: Here's 'Nominated'
Y: Good, it's me.
Y->S1: Give me all files in directory.
S1->Y: Files are 'Nominated' and '10.0.0.5'
M: Good, it's me.
M->S2: Give me all files in directory.
S2->M: Files are 'Nominated' and '10.6.6.6'

Well, yes, again if you think I'm storing separate files on separate services or something, yes, it would be broken. But I'm not. It's one bucket on one service, with the guarantee that eventually (and presumedly promptly[+]) you get the results you're expecting.

[+] Promptly: within the normal latency time for propagation inside Amazon's network. Note that even FTP doesn't guarantee that the file you wrote a week ago will be updated with the contents you stored, for example.


1F) Here, "Nominated" agrees with every machine's concept of what's in Nominated. I.e., every machine has read the same value out of "Nominated", and everyone read the same value, and it's ME! So I'm elected. Write "10.0.0.5" into Elected. Go to phase 3.

Y->S1: Write "10.0.0.5" to "Elected".
M->S2: Write "10.6.6.6" to "Elected".

We have two masters. Whoops.

You persist in behaving as if I'm talking to two different services. Not sure why you think this is the case. You're also blaming failures of the file system on my algorithm.

If I have a two-disk RAID, I don't start postulating situations where two writers write to the same file and get different answers. I have to account for two users writing to the same file. I don't have to account for each user seeing their own data persistently.

Spinloop?

Yes.

This is those processes who don't think they are master.

Yes.

Who deletes ASSIGNED?

I'm simplifying a bit. There's actually multiple assigned and statuses, along with log files that the servant writes and the master reads, for example.

3) The assign-work steps:

3A) I'm master. Scan the directory looking for something that begins with "READY". If there's a matching "STATUS", read the status file and store the results, then delete the STATUS file.

You can lose (a lot?) of the data that way, if the process writes to STATUS
between when you read it and when you deleted it.

Nobody writes to the status file twice in a row. Remember that writes are atomic, so I bundle up all the data I want for the status (which is about 2 lines) and write it.

3B) Look for a "READY" without a "STATUS". When found, pick a job, assign it by writing it into "ASSIGNED" with the same extension. If none are left, write "EXIT" there.

This REALLY depends on not having two masters.

Yes.

3C) If there are no "READY" files and no jobs left, exit.

Let's say that M got to be master, and Y didn't. So Y posts a READY,
and M doesn't see it, and so exits.  Y is waiting on M, and M is done
and gone.

So you'd need a watchdog timer....

Everyone runs watchdog timers, yes. As I said, I simplified because the discussion was about how to do lockless synchronization, not the details of assigning jobs, reading log files, etc.

For example, when M exits, it deletes its own status file, and Y sees that go away and realizes there's no master, so it logs that it's exiting for lack of master and exits too.


3D) Well, you get the idea.

Note that other than the initial "Nominated" file, nobody ever writes to a file at the same time as anyone else. Indeed, I don't think there's any file which two processes ever write to.

There's "Nominated" and "Elected".

Right. Only the elected machine ever writes to Elected.

Since the writes are atomic, I'm not entirely sure two files gives you
anything.

The creation of Elected signals that the master has been chosen. It's also a slight efficiency thing, in that if Elected exists, a new slave coming online doesn't have to look at more than one file to see who the master is.

Using the IP address (actually, IP address + port number or pid might be
the better way to go, otherwise two processes on the same machine could
screw things up pretty bad) to partition the information is a clever
thing to do.

Yeah, I ran into that a bit, when I realized the Master and one Slave could both run on the same machine. It's a little bit ugly, and I might rework it some when the time comes to clean it up for further release (which will be when Tcl 8.5 goes gold).

And it's actually the host name. But you can set it to actually identify itself however you want, and run out of whichever directory you like. In my testing, I often had five slaves, five competing masters, and the GUI all running on my pour single-core machine here. :-)

--
  Darren New / San Diego, CA, USA (PST)
    His kernel fu is strong.
    He studied at the Shao Linux Temple.

--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to