Hi fellow Riak users – a special shout-out to those I got a chance to meet
at Ricon 2012.
I learned some new things, interacted with really smart people, and decided
to do a Riak project to learn about this tech.
5 node cluster, all proud of myself that I got JSON searching etc. working
(some speed/other things to clean up), haproxy, but overall, things are
looking pretty good.
I’m pumping in loads of JSON data into these KV pairs, and things are
looking good…
UNTIL….
I let the nodes run out of disk-space… OUCH rookie mistake.
But I’m thinking – this shouldn’t (but can) happen in real life, lets see
how well Riak recovers from such a thing. Learning opportunity!
--
The boxes were wedged HARD – no login possible, so they had to be rebooted.
Riak wasn’t running after reboot (just the epm daemon), rest couldn’t start
since disk was still 100% full.
I brought some new filesystems online, (properly) copied the content of
leveldb, and pointed the config-files to the new location of leveldb,
cleaned some disk-space. Did this on all 5 nodes.
On 3 out of 5 nodes, things came back to life… Not bad. But on 2 nodes,
there is no joy. (riak console error below).
I found some utilities that check the merge_index for consistency, and that
didn’t point to any issues.
--
So since 3/5 nodes are up, I THINK all my data is safe. Is there some sort
of scrubbing operation I can run to basically fsck the bajezus out of this
thing to ensure the data is correct/consistent/fully available across these
3 surviving nodes? Its taking writes again, so it appears (on the surface)
to be in reasonable shape, eventhough 2/5 nodes are MIA.
I guess I would have been up the creek if 2/5 came back to life, not 3/5?
Is there any value (or possibility) to bring one of the failed nodes back
to life, or just nuke the data and re-join?
This sort of thing is bound to happen in real life to some unsuspecting
person out there. Is this a bug?
Should Riak have handled an out of space condition a bit cleaner? (if not
from crashing from a recovery perspective?) I would have expected some sort
of atomic recovery point.
Any general thoughts/guidance/education (other than “never let your riak
cluster run out of disk-space”) are much appreciated.
Thanks,
n MikeE
-bash-4.1$ riak console
Exec: /usr/lib64/riak/erts-5.9.1/bin/erlexec -boot
/usr/lib64/riak/releases/1.2.0/riak -embedded -config
/etc/riak/app.config -pa
/usr/lib64/riak/basho-patches -args_file /etc/riak/vm.args --
console
Root: /usr/lib64/riak
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [async-threads:64]
[kernel-poll:true]
=INFO REPORT==== 16-Nov-2012::02:59:34 ===
alarm_handler: {set,{system_memory_high_watermark,[]}}
** /usr/lib64/riak/lib/observer-1.1/ebin/etop_txt.beam hides
/usr/lib64/riak/lib/basho-patches/etop_txt.beam
** Found 1 name clashes in code paths
02:59:35.273 [info] Application lager started on node 'riak@xxxxxxxxxxx'
02:59:35.392 [error] CRASH REPORT Process <0.149.0> with 0 neighbours
exited with reason: bad argument in call to erlang:binary_to_term(<<>>) in
riak_core_ring_manager:read_ringfile/1 line 154 in gen_server2:init_it/6
line 384
/usr/lib64/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.
=INFO REPORT==== 16-Nov-2012::02:59:35 ===
alarm_handler: {clear,system_memory_high_watermark}
Erlang has closed
{"Kernel pid
terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}
Crash dump was written to: /var/log/riak/erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})
-bash-4.1$
* *
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com