Re: Metadata replication in tabled

2010-06-28 Thread Pete Zaitcev
On Mon, 28 Jun 2010 08:37:51 -0400
Jeff Darcy jda...@redhat.com wrote:

 First, it seems like trying to do stuff under BDB replication, letting
 them control the flow, is proving to be rather painful - over a thousand
 lines in metarep.c plus other bits elsewhere, all constrained by their
 expectations wrt blocking, dropping messages, etc.  Might it not be
 simpler to handle the replication *above* BDB instead, shipping the
 operations ourselves to single-node BDB instances?  Simpler still might
 be to let a general framework like Gizzard handle the N-way replication
 and failover, or switch to a data store that's designed from the ground
 up around that need. []

I thought of it a little and decided to try the base API approach first,
for a couple of reasons.

First, I am ignorant of things like Gizzard, so when I started
imagining how the update forwarding and leases would actually work,
it started looking way longer than a 1000 lines of C.

Second, I am afraid that people will point and ask why didn't you
use rep_start(). We already reap NIH critique with Zookeeper.
Now if I tried, found bugs in db4/BDB, and documented that, it would
be different and my conscience would be clear.

Getting all of the replication exposed in tabled is really tempting.
For one thing, if we do it, we can replace db4 with TC or anything else.
But it's just... too much. I don't have the balls to tackle it now.
Honestly I expected to finish it all in 1 week, but actually took 3+.
The roll-my-own replication would take me forever (how about 6 months?).
Do you want tabled working for you or always in progress?

 (a) The minor problem is that if the second (inner) check for
 nrp-length  3 fails, then we return directly - leaking *nrp.
 Perhaps we should jump to the ncld_read_free at the end instead.

Awww, that was silly. Thanks.

 (b) I'd also question whether checking nrp-length this way is necessary
 at all, since cldu_parse_master should fail in those cases anyway.  Why
 not just rearrange the loop to catch such errors that way?

The idea was to special-case the empty so I can see a printout.
A syntax error is different - maybe a version mismatch. I even wanted
the would-be masters try and truncate the MASTER file before trying
to lock it.

 (c) Lastly, regarding the comment about the gap between lock and write,
 I think single retry of only the read doesn't buy us much. [...]
 At the other end of the scale, it's also not hard
 to imagine a node managing to take the lock and then itself aborting
 before the write, again causing other nodes to fail.  What should happen
 in this second case, I'd argue, is that CLD should eventually detect the
 failure and break the lock, which would allow another waiting node to
 take it.

Well, yeah... I guess I was too lazy and reluctant to create yet
another state machine for this. Maybe I should just bite the bullet
and make tabled fully multi-threaded. It was likely to come next
anyway since you complained about the abysmal performance (I do not
know yet what the issues with performance are, but threads are
likely to participate). But if so, a thread may just easily loop,
as ncld API intends.

-- Pete
--
To unsubscribe from this list: send the line unsubscribe hail-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata replication in tabled

2010-06-25 Thread Jeff Garzik

On 06/24/2010 08:31 PM, Pete Zaitcev wrote:

I worked on fixing the metadata replication in tabled. There were some
difficulties in existing code, in particular the aliasing between the
hostname used to identify nodes and the hostname used in bind() for
listening was impossible to work around in repmgr. In the end I gave
up on repmgr and switched tabled to the Base API. So, the replication
works now... for some values of works, which is still a progress.

We essentially have a tabled that can really be considered as replicated.
Before, it was only data replication, which was great and all but
useless against disk failues in the tabled's database. I think it's
a major treshold for tabled.


er, huh?  In addition to data replication, we already have metadata 
replication via db4 repmgr in tabled.git, which ensures metadata db 
integrity in the case of disk or tabled node failure.


The core problem with current tabled.git is that S3 clients expect all 
nodes to support PUT/DELETE as well as GET.  Our current use w/ db4 
slave mode does not fulfill this client requirement.


Your work here, moving to the base replication API, eliminates several 
obstacles on the path to making all tabled nodes support PUT/DELETE. 
But it is not true to say that metadata replication did not exist prior 
to this patch.


With either repmgr or base API, we still need to make failover more 
transparent to our S3 clients.




Unfortunately, the code is rather ugly. I tried to create a kind
of an optional replication layer, so that tdbadm could be built
without it. Although I succeeded, the result is a hideous mess of
methods and callbacks, functions with side effects, and a bunch
of poorly laid out state machines. In places I cannot wrap my own
head around what's going on without a help of pencil and paper.

So, while working, it's not ready for going in. Still, I'm going
to throw it here in case I get hit by a bus, or if anyone wants
an example of using db4 replication early.


Based on a quick read, it seems straightforward, and looks like 
something I can try tomorrow...


Very excited to try this :)

Jeff




--
To unsubscribe from this list: send the line unsubscribe hail-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html