RE: I did a bad bad thing....

Mike Ellis Thu, 15 Nov 2012 19:35:27 -0800

Hi fellow Riak users – a special shout-out to those I got a chance to meet
at Ricon 2012.




I learned some new things, interacted with really smart people, and decided
to do a Riak project to learn about this tech.



5 node cluster, all proud of myself that I got JSON searching etc. working
(some speed/other things to clean up), haproxy, but overall, things are
looking pretty good.



I’m pumping in loads of JSON data into these KV pairs, and things are
looking good…

UNTIL….

I let the nodes run out of disk-space… OUCH rookie mistake.



But I’m thinking – this shouldn’t (but can) happen in real life, lets see
how well Riak recovers from such a thing. Learning opportunity!



--



The boxes were wedged HARD – no login possible, so they had to be rebooted.

Riak wasn’t running after reboot (just the epm daemon), rest couldn’t start
since disk was still 100% full.



I brought some new filesystems online, (properly) copied the content of
leveldb, and pointed the config-files to the new location of leveldb,
cleaned some disk-space. Did this on all 5 nodes.



On 3 out of 5 nodes, things came back to life… Not bad. But on 2 nodes,
there is no joy.  (riak console error below).



I found some utilities that check the merge_index for consistency, and that
didn’t point to any issues.



--



So since 3/5 nodes are up, I THINK all my data is safe. Is there some sort
of scrubbing operation I can run to basically fsck the bajezus out of this
thing to ensure the data is correct/consistent/fully available across these
3 surviving nodes? Its taking writes again, so it appears (on the surface)
to be in reasonable shape, eventhough 2/5 nodes are MIA.



I guess I would have been up the creek if 2/5 came back to life, not 3/5?

Is there any value (or possibility) to bring one of the failed nodes back
to life, or just nuke the data and re-join?

This sort of thing is bound to happen in real life to some unsuspecting
person out there. Is this a bug?

Should Riak have handled an out of space condition a bit cleaner? (if not
from crashing from a recovery perspective?) I would have expected some sort
of atomic recovery point.



Any general thoughts/guidance/education (other than “never let your riak
cluster run out of disk-space”) are much appreciated.



Thanks,



n  MikeE





-bash-4.1$ riak console

Exec: /usr/lib64/riak/erts-5.9.1/bin/erlexec -boot
/usr/lib64/riak/releases/1.2.0/riak             -embedded -config
/etc/riak/app.config             -pa
/usr/lib64/riak/basho-patches             -args_file /etc/riak/vm.args --
console

Root: /usr/lib64/riak

Erlang R15B01 (erts-5.9.1) [source] [64-bit] [async-threads:64]
[kernel-poll:true]





=INFO REPORT==== 16-Nov-2012::02:59:34 ===

    alarm_handler: {set,{system_memory_high_watermark,[]}}

** /usr/lib64/riak/lib/observer-1.1/ebin/etop_txt.beam hides
/usr/lib64/riak/lib/basho-patches/etop_txt.beam

** Found 1 name clashes in code paths

02:59:35.273 [info] Application lager started on node 'riak@xxxxxxxxxxx'

02:59:35.392 [error] CRASH REPORT Process <0.149.0> with 0 neighbours
exited with reason: bad argument in call to erlang:binary_to_term(<<>>) in
riak_core_ring_manager:read_ringfile/1 line 154 in gen_server2:init_it/6
line 384

/usr/lib64/riak/lib/os_mon-2.2.9/priv/bin/memsup: Erlang has closed.



=INFO REPORT==== 16-Nov-2012::02:59:35 ===

    alarm_handler: {clear,system_memory_high_watermark}

Erlang has closed

                 {"Kernel pid
terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}



Crash dump was written to: /var/log/riak/erl_crash.dump

Kernel pid terminated (application_controller)
({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})

-bash-4.1$





* *

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

RE: I did a bad bad thing....

Reply via email to