Hi Flavio,
> I don't understand why you have to ship the log to the read only replicas.
> Aren't you storing the log on HDFS currently? Can't they read from HDFS
> directly?
Possibly the replicas can "tail" the WAL of the master, was using the term log
shipping in the abstract. However I'm not an HDFS expert so unsure if we could
read the last (partial) block in the WAL. Newly written data exists only in
memory so the WAL would be the only option for transmitting this data until
flush without some sort of direct replication.
> I wonder why you are choosing 3 for the size of a clique and not letting it
> be a free parameter.
It would but 3 seems a reasonable default. (?)
> Are you choosing 3 to avoid the replication overhead?
Yes.
> #1 is relatively simple but trades away the consistency
> I don't see where you could have inconsistencies here. Would you mind
> elaborating a bit further?
At any given instant queries to a replica may not return the same result as the
(write) master for data in memstore and (possibly) in the last block of the WAL.
Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back.
- Piet Hein (via Tom White)
--- On Wed, 2/2/11, Flavio Junqueira <[email protected]> wrote:
From: Flavio Junqueira <[email protected]>
Subject: Re: Extracting Zab from Zookeeper
Date: Wednesday, February 2, 2011, 2:14 AM
Hi Andrew, Interesting use case, thanks for sharing. I'm curious about a few
things:
On Feb 1, 2011, at 5:38 PM, Andrew Purtell wrote:
Two ideas actually:
1) Do pretty straightforward log shipping from region master to read only
replicas.
I don't understand why you have to ship the log to the read only replicas.
Aren't you storing the log on HDFS currently? Can't they read from HDFS
directly?
2) Divide the cluster into quorum 3-cliques. Extract ZAB and use it to maintain
consensus on writes from region master to two read only replicas. Run the
consensus protocol in parallel with HDFS hflush to the write ahead log. Needs a
lot of work filling in the detail, obviously, but that's the general notion.
I wonder why you are choosing 3 for the size of a clique and not letting it be
a free parameter. I would think that this a decision of the user. Are you
choosing 3 to avoid the replication overhead?
#1 is relatively simple but trades away the consistency for which HBase is
indicated for higher availability (for reads) when regions are in transition.
I don't see where you could have inconsistencies here. Would you mind
elaborating a bit further?
#2 is not simple at all but may let maintain replicas that are fully consistent
at all times with the region master, not lower region master write performance
unacceptably, and also gain the higher availability (for reads) when regions
are in transition.
Agreed, it will be tricky, especially because we would have to extract Zab
first.
Cheers,-Flavio
flavio junqueira
research scientist
[email protected]
direct +34 93-183-8828
avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300 fax (408) 349 3301