Hi Keith,
The first set of errors you saw ("Protocol: ~p: register error: ~p~n")
indicate an Erlang node was already running with this name; node 2 may have
been running in the background without you realizing it.
The second error which occurred when choosing a different name was probably
due to a port binding issue; this means the ports node 2 tried binding to
(handoff, web, pb) were already occupied. Again, node 2 may have already
been running in the background.
After rebooting the machine it looks like starting node 2 was successful.
Regarding the ringready failure, can you run "riak-admin status" on all
three nodes? Also, can you send in the log files for node 2 (the entire log
directory would be great)?
Thanks,
Dan
Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[email protected]
On Fri, Apr 1, 2011 at 1:57 PM, Keith Dreibelbis <[email protected]> wrote:
> Hi riak-users,
>
> I have a node in a cluster of 3 that failed and won't come back up. This
> is in a dev environment, so it's not like there's critical data on there.
> However, rather than start over with a new install, I want to learn how to
> recover from such a failure in production. I figured there was enough
> redundancy such that node 2 could recover with (at worst) a little help from
> nodes 1 and 3.
>
> When I tried to restart/reboot (I tried both), this showed up in
> erlang.log.1:
>
> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded
> -config /Users/keith/src/riak/dev/dev2/etc/app.confi
>
> g -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args --
> console
>
> Root: /Users/keith/src/riak/dev/dev2
>
> {error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error:
> ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta
>
>
> rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M
>
>
> {error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it
>
>
> ,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h
>
> eap_size,377},{stack_size,24},{reductions,456}],[]]}^M
>
>
> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},
>
> {mfargs,{net_kernel,start_link,[['[email protected]
> ',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M
>
>
> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di
>
>
> stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M
>
>
> {error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M
>
> {"Kernel pid
> terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M
>
> ^M
>
> Crash dump was written to: erl_crash.dump^M
>
> Kernel pid terminated (application_controller)
> ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M
>
>
> http://wiki.basho.com/Recovering-a-failed-node.html suggests starting the
> node in console mode with a different name. This didn't help, it just
> crashed again. I'm using bitcask (the default) while the example on that
> page gives output like InnoDB would return.
>
> kratos:dev2 keith$ bin/riak console -name differentname@nohost
> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded
> -config /Users/keith/src/riak/dev/dev2/etc/app.config -args_file
> /Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name
> differentname@nohost
> Root: /Users/keith/src/riak/dev/dev2
> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2]
> [async-threads:64] [hipe] [kernel-poll:true]
>
>
> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
> alarm_handler: {set,{system_memory_high_watermark,[]}}
> ** Found 0 name clashes in code paths
>
> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
> application: riak_core
> exited: {shutdown,{riak_core_app,start,[normal,[]]}}
> type: permanent
>
> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
> alarm_handler: {clear,system_memory_high_watermark}
> {"Kernel pid
> terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}
>
> Crash dump was written to: erl_crash.dump
> Kernel pid terminated (application_controller)
> ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})
> kratos:dev2 keith$
>
> After I rebooted my machine and tried starting the trio of riak nodes,
> again node 2 is not responding to pings, and "riak-admin ringready" from
> nodes 1 and 3 complain that node 2 is down. But in the log, node 2 is
> saying it's ALIVE. Also, I can see processes for all 3 nodes in ps:
>
> kratos:~ keith$ ps auxww | grep riak
> keith 360 0.2 3.4 2606932 143044 s006 Ss+ 12:05PM 3:21.61
> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 --
> -root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith --
> -boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config
> /Users/keith/src/riak/dev/dev1/etc/app.config -name [email protected]
> riak -- console
> keith 580 0.1 2.0 2549924 85492 s008 Ss+ 12:05PM 2:24.08
> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 --
> -root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith --
> -boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config
> /Users/keith/src/riak/dev/dev3/etc/app.config -name [email protected]
> riak -- console
> keith 380 0.0 0.0 2435004 268 ?? S 12:05PM 0:00.08
> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon
> keith 358 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01
> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon
> /tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log
> exec /Users/keith/src/riak/dev/dev1/bin/riak console
> keith 1633 0.0 0.0 2435548 0 s010 R+ 1:34PM 0:00.00 grep
> riak
> keith 578 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.00
> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon
> /tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log
> exec /Users/keith/src/riak/dev/dev3/bin/riak console
> keith 470 0.0 2.0 2548688 83584 s007 Ss+ 12:05PM 0:33.41
> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 --
> -root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith --
> -boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config
> /Users/keith/src/riak/dev/dev2/etc/app.config -name [email protected]
> riak -- console
> keith 468 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01
> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon
> /tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log
> exec /Users/keith/src/riak/dev/dev2/bin/riak console
> kratos:~ keith$
>
> I've attached the erl_crash.dump file. Anyone have an explanation or
> suggestions on how to proceed?
>
>
> Keith
>
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com