begin  quoting Darren New as of Fri, May 04, 2007 at 08:54:28AM -0700:
> Stewart Stremler wrote:
> >It's sounding like it's less and less a good general-purpose approach
> >and more and more a specific-problem approach.
> 
> Here's a specific example. I have a program that communicates over the 
> Amazon S3 remote distributed file system. The user sets up a bunch of 
> named jobs, then launches off the same process on a dozen different 
> machines, and lets them all take on jobs as they become available.
> 
> The S3 semantics are that reads and writes are atomic, but there's no 
> locking or test-and-set mechanism.  You can delete a file someone else 
> just deleted without any errors (much to my initial surprise, as that 
> was going to be my test-and-set).

So let me see if I have our assumptions correct:

(1) Writes and reads are atomic, but may not be persistent. That is,
the data in a file will not ever be corrupt

(2) Data in a file might be silently replaced at any time.

(3) Files may be silently removed at any time.

(4) No latency guarantees are made for how long a change to become
universal.

(5) No ordering guarantees are made for which server updates which
of its siblings when, or how.

>                                   You can't create a file that someone 
> else can't write over an instant later. But the contents you read are 
> the complete contents that were written, and what you write gets updated 
> atomically at some point after you write it. So here's (approximately) 
> what I did, described from the point of view of the machine at 10.0.0.5 
> IP address.

Okay.

But to make things simpler...

...let's call your machine Y, the servers S#, and Murphy's machine M.

> 1) Negotiate who is the master:
> 
> 1A) If "Elected" exists, copy its contents to the file named "10.0.0.5" 
> and go to phase 3 if it matches my IP address, phase 2 if it doesn't.

So we read "Elected", keep the value, and then write whatever it was
back to a unique-per-machine (No concurrent use on a single machine)
filename.

Hm.

What if it ("Elected") doesn't exist?

> 1B) If "Nominated" exists, copy its contents to the file named 
> "10.0.0.5" and go to 1D.

Presumably, 1C is when "Nominated" doesn't exist?

> 1C) Write the string "10.0.0.5" into "10.0.0.5" and into "Nominated". 
> (Here, we don't think either of those files exist, so we think we're the 
> first to come online. Yes, this is a race condition. See 1F below.)

Y->S1: Does 'Nominated' exist?
S1->Y: No
M->S2: Does 'Nominated' exist?
S2->M: No.
M->S2: Write "10.6.6.6" into "Nominated"
M->S2: Write "10.6.6.6." into "10.6.6.6"
Y->S1: Write "10.0.0.5" into "Nominated"
Y->S1: Write "10.0.0.5" into "10.0.0.5"

> 1D) Read "Nominated". If it does not contain your own IP address and it 
> does not contain what you most recently wrote into the "10.0.0.5" file, 
> write the contents of Nominated into your own "10.0.0.5" file and go 
> back to 1A.

Y->S1: Give me 'Nominated'
S1->Y: Here's 'Nominated'
S1->S3: Sync Event: here's a file named "Nominated".
S2->S1: Sync Event: here's a file named "Nominated".
S3->S2: Sync Event: here's a file named "Nominated".
M->S2: Give me 'Nominated'
S2->M: Here's 'Nominated'

So Y has 10.6.6.6 and M has 10.0.0.5 as the contents of Nominated.

Potential infinite loop.

Whoops.

> 1E) If "Nominated" contained your own IP address, read every other file 
> in the directory besides "Nominated" and "Elected", and see if they all 
> match the contents of "Nominated". If not, go back to 1A.

So instead of the 1D transaction, we get:

Y->S1: Give me 'Nominated'
S1->Y: Here's 'Nominated'
M->S2: Give me 'Nominated'
S2->M: Here's 'Nominated'
Y: Good, it's me.
Y->S1: Give me all files in directory.
S1->Y: Files are 'Nominated' and '10.0.0.5'
M: Good, it's me.
M->S2: Give me all files in directory.
S2->M: Files are 'Nominated' and '10.6.6.6'

> 1F) Here, "Nominated" agrees with every machine's concept of what's in 
> Nominated. I.e., every machine has read the same value out of 
> "Nominated", and everyone read the same value, and it's ME! So I'm 
> elected. Write "10.0.0.5" into Elected. Go to phase 3.

Y->S1: Write "10.0.0.5" to "Elected".
M->S2: Write "10.6.6.6" to "Elected".

We have two masters. Whoops.

> 2) The wait-for-work steps:
> 
> 2A) I'm not elected as master, so write the file "READY-10.0.0.5".
> 
> 2B) Wait for "ASSIGNED-10.0.0.5" to show up, then delete "READY-10.0.0.5".
 
Spinloop?

This is those processes who don't think they are master.

> 2C) Read "Assigned." If it says "exit", delete my original 10.0.0.5 file 
> from phase 1 and exit.
> 
> 2D) If it has the name of a job, read and run the job, writing ongoing 
> status to "STATUS-10.0.0.5". When you finish, recreate "READY-10.0.0.5".

Who deletes ASSIGNED?

> 3) The assign-work steps:
> 
> 3A) I'm master. Scan the directory looking for something that begins 
> with "READY". If there's a matching "STATUS", read the status file and 
> store the results, then delete the STATUS file.

You can lose (a lot?) of the data that way, if the process writes to STATUS
between when you read it and when you deleted it.

> 3B) Look for a "READY" without a "STATUS". When found, pick a job, 
> assign it by writing it into "ASSIGNED" with the same extension. If none 
> are left, write "EXIT" there.

This REALLY depends on not having two masters.

> 3C) If there are no "READY" files and no jobs left, exit.

Let's say that M got to be master, and Y didn't. So Y posts a READY,
and M doesn't see it, and so exits.  Y is waiting on M, and M is done
and gone.

So you'd need a watchdog timer....

> 3D) Well, you get the idea.
> 
> Note that other than the initial "Nominated" file, nobody ever writes to 
> a file at the same time as anyone else. Indeed, I don't think there's 
> any file which two processes ever write to.

There's "Nominated" and "Elected".

Since the writes are atomic, I'm not entirely sure two files gives you
anything.

>                                             Also, the protocol doesn't 
> depend on the propagation of information to be synchronous - it's OK if 
> I overwrite a file and you read the old version a few times before you 
> see the change.

Using the IP address (actually, IP address + port number or pid might be
the better way to go, otherwise two processes on the same machine could
screw things up pretty bad) to partition the information is a clever
thing to do.

-- 
Time for a chalkboard and colored chalk!
Stewart Stremler

-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-lpsg

Reply via email to