After reading todays recap, I am a bit unsure:

5) Q --- Would Riak handle an individual vnode failure the same way as
an entire node failure? (from grourk via #riak)

    A --- Yes. The request to that vnode would fail and will be routed
to the next available vnode

Is it really handled the same way? I don't believe handoff will occur. The R/W values still apply of course, but I think there will be one less replica of the keys that map to the failed vnode until the situation. I have delved quite a bit into the riak code, but if I really missed something I would be glad if someone could point me to the place where a vnode failure is detected. As far as I can see, the heavy lifting happens in riak_kv_util:try_cast/5 ( https://github.com/basho/riak_kv/blob/riak_kv-0.14.1/src/riak_kv_util.erl#L78), which only checks if the whole node is up.


On 24.03.2011 00:56, Nico Meyer wrote:
Hi Greg,

I don't think the vnodes will always die. I have seen some situations
(disk full, filesystem becoming read only due to device errors,
corrupted bitcask files after a mashine crash) where the vnode did not
crash, but the get and/or put requests returned errors.
Even if the process crashes, it will just be restarted, possibly over
and over again.
Also, the handoff logic only operates on the level of a whole node, not
individual vnodes, which makes monitoring and detecting disk failures
very important.

We were also thinking about how to use multiple disk per node. But its
not a very pressing problem for us, since we have a lot of relatively
small entries (~1000 bytes), so the RAM used by bitcask is a problem
long before we can even fill one disk.

Cheers,
Nico


On 23.03.2011 23:50, Greg Nelson wrote:
Hi Joe,

With a few hours of investigation today, your patch is looking
promising. Maybe you can give some more detail on what you did in your
experiments a few months ago?

What I did was set up a Ubuntu VM with three loopback file systems. Then
built Riak 0.14.1 with your patch, configured as you described to spread
across the three disks. I ran a single node, and it correctly spread
partitions across the disks.

I then corrupted the file system on one of the disks (by zeroing out the
loop device), and did some more GETs and PUTs against Riak. In the logs
it looks like the vnode processes that had bitcasks on that disk died,
as expected, and the other vnodes continued to operate.

I need to do a bit more investigation with more than one node, but given
how well it handled this scenario, it seems like we're on the right
track.

Oh, one thing I noticed is that while Riak starts up, if there's a bad
disk then it will shutdown (the whole node), at this line:

https://github.com/jtuple/riak_kv/blob/jdb-multi-dirs/src/riak_kv_bitcask_backend.erl#L103


That makes sense, but I'm wondering if it's possible to let the node
start since some of its vnodes would be able to open their bitcasks just
fine. I wonder if it's as simple as removing that line?

Greg

On Tuesday, March 22, 2011 at 9:54 AM, Joseph Blomstedt wrote:

You're forgetting how awesome riak actually is. Given how riak is
implemented, my patches should work without any operational headaches
at all. Let me explain.

First, there was the one issue from yesterday. My initial patch didn't
reuse the same partition bitcask on the same node. I've fixed that in
a newer commit:
https://github.com/jtuple/riak_kv/commit/de6b83a4fb53c25b1013f31b8c4172cc40de73ed


Now, about how this all works in operation.

Let's consider a simple scenario under normal riak. The key concept
here is to realize that riak's vnodes are completely independent, and
that failure and partition ownership changes are handled through
handoff alone.

Let's say we have an 8-partition ring with 3 riak nodes:
n1 owns partitions 1,4,7
n2 owns partitions 2,5,8
n3 owns partitions 3,6
ie: Ring = (0/n1, 1/n2, 2/n3, 3/n1, 4/n2, 5/n3, 6/n1, 7/n2, 8/n3)

Each node runs an independent vnode for each partition it owns, and
each vnode will setup it's own bitcask:

vnode 0/1: {n1-root}/data/bitcask/1
vnode 0/4: {n1-root}/data/bitcask/4
...
vnode 2/2: {n2-root}/data/bitcask/2
...
vnode 3/6: {n3-root}/data/bitcask/6

Reads/writes are routed to the appropriate vnodes and to the
appropriate bitcasks. Under failure, hinted handoff comes into play.

Let's have a write to preflist [1,2,3] while n2 is down/split. Since
n2 is down, riak will send the write meant for partition 2 to another
node, let's say n3. n3 will spawn a new vnode for partition 2 which is
initially empty:

vnode 3/2: {n3-root}/data/bitcask/2

and, write the incoming write to the new bitcask.

Later, when n2 rejoins, n3 will eventually engage in handoff, and send
all (k,v) in its data/bitcask/2 to n2, which writes them into its
data/bitcask/2. After handing off data, n3 will shutdown it's 3/2
vnode and delete the bitcask directory {n3-root}/data/bitcask/2.

Under node rebalancing / ownership changes, a similar event occurs.
For example, if a new node n4 takes ownership of partition 4, then n1
will handoff it's data to n4 and then shutdown its vnode and delete
its {n1-root}/data/bitcask/4.

If you take the above scenario, and change all the directories of the
form:
{NODE-root}/data/bitcask/P
to:
/mnt/DISK-N/NODE/bitcask/P

and allow DISK-N to be any randomly chosen directory in /mnt, then the
scenario plays out exactly the same provided that riak always selects
the same DISK-N for a given P on a given node (across nodes doesn't
matter, vnodes are independent). My new commit handles this. A simple
configuration could be:

n1-vars.config:
{bitcask_data_root, {random, ["/mnt/bitcask/disk1/n1",
"/mnt/bitcask/disk2/n1", "/mnt/bitcask/disk3/n1"]}}
n2-vars.config:
{bitcask_data_root, {random, ["/mnt/bitcask/disk1/n2",
"/mnt/bitcask/disk2/n2", "/mnt/bitcask/disk3/n2"]}}
(...etc...)

There is no inherent need for symlinks, or needing to pre-create any
initial links per partition index. riak already creates and deletes
partition bitcask directories on demand. If a disk fails, then all
vnodes with bitcasks on that disk fail in the same manner as a disk
failure under normal riak. Standard read repair, handoff, and node
replacement apply.

-Joe

On Tue, Mar 22, 2011 at 9:53 AM, Alexander Sicular <sicul...@gmail.com
<mailto:sicul...@gmail.com>> wrote:
Ya, my original message just highlighted the standard 0,1,5 that most
people/hardware should know/be able to support. There are better
options and
10 would be one of them.


@siculars on twitter
http://siculars.posterous.com
Sent from my iPhone
On Mar 22, 2011, at 8:43, Ryan Zezeski <rzeze...@gmail.com
<mailto:rzeze...@gmail.com>> wrote:



On Tue, Mar 22, 2011 at 10:01 AM, Alexander Sicular
<sicul...@gmail.com <mailto:sicul...@gmail.com>>
wrote:

Save your ops dudes the headache and just use raid 5 and be done
with it.

Depending on the number of disks available I might even argue running
software RAID 10 for better throughput and less chance of data loss
(as long
as you can afford to cut your avail storage in half on every
machine). It's
not too hard to setup on modern Linux distros (mdadm); at least I was
doing
it 5 years ago and I'm no sys admin.
-Ryan

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com <mailto:riak-users@lists.basho.com>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to