Thanks for the response, Dan. Yes, the problem is that it *looks* like starting node 2 was successful (says it's ALIVE, shows up in ps). But it's not responding to pings, isn't usable, and nodes 1 and 3 say node 2 isn't connected.
As you suggested, here is the output of riak-admin status for the 3 nodes, and I'll attach a tarball for node 2's log directory. Keith kratos:dev1 keith$ bin/riak-admin status 1-minute stats for '[email protected]' ------------------------------------------- vnode gets : 0 vnode_puts : 0 read_repairs : 0 vnode_gets_total : 6251 vnode_puts_total : 1064 node_gets : 0 node_gets_total : 4786 node_get_fsm_time_mean : 0 node_get_fsm_time_median : 0 node_get_fsm_time_95 : 0 node_get_fsm_time_99 : 0 node_get_fsm_time_100 : 0 node_puts : 0 node_puts_total : 774 node_put_fsm_time_mean : 0 node_put_fsm_time_median : 0 node_put_fsm_time_95 : 0 node_put_fsm_time_99 : 0 node_put_fsm_time_100 : 0 read_repairs_total : 354 cpu_nprocs : 127 cpu_avg1 : 164 cpu_avg5 : 202 cpu_avg15 : 205 mem_total : 3264444000 mem_allocated : 3155680000 disk : [{"/",488050672,13}] nodename : '[email protected]' connected_nodes : ['[email protected]'] sys_driver_version : <<"1.5">> sys_global_heaps_size : 0 sys_heap_type : private sys_logical_processors : 2 sys_otp_release : <<"R14B01">> sys_process_count : 206 sys_smp_support : true sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">> sys_system_architecture : <<"i386-apple-darwin10.7.0">> sys_threads_enabled : true sys_thread_pool_size : 64 sys_wordsize : 8 ring_members : ['[email protected]','[email protected]','[email protected]'] ring_num_partitions : 64 ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{' [email protected]',22}]">> ring_creation_size : 64 storage_backend : riak_kv_bitcask_backend pbc_connects_total : 350 pbc_connects : 0 pbc_active : 0 riak_err_version : <<"1.0.1">> runtime_tools_version : <<"1.8.4.1">> basho_stats_version : <<"1.0.1">> luwak_version : <<"1.0.0">> skerl_version : <<"1.0.0">> riak_kv_version : <<"0.14.0">> bitcask_version : <<"1.1.5">> riak_core_version : <<"0.14.0">> riak_sysmon_version : <<"0.9.0">> luke_version : <<"0.2.3">> erlang_js_version : <<"0.5.0">> mochiweb_version : <<"1.7.1">> webmachine_version : <<"1.8.0">> crypto_version : <<"2.0.2">> os_mon_version : <<"2.2.5">> cluster_info_version : <<"1.1.0">> sasl_version : <<"2.1.9.2">> stdlib_version : <<"1.17.2">> kernel_version : <<"2.14.2">> executing_mappers : 0 kratos:dev2 keith$ bin/riak-admin status Node is not running! kratos:dev3 keith$ bin/riak-admin status 1-minute stats for '[email protected]' ------------------------------------------- vnode gets : 0 vnode_puts : 0 read_repairs : 0 vnode_gets_total : 7061 vnode_puts_total : 1198 node_gets : 0 node_gets_total : 0 node_get_fsm_time_mean : 0 node_get_fsm_time_median : 0 node_get_fsm_time_95 : 0 node_get_fsm_time_99 : 0 node_get_fsm_time_100 : 0 node_puts : 0 node_puts_total : 0 node_put_fsm_time_mean : 0 node_put_fsm_time_median : 0 node_put_fsm_time_95 : 0 node_put_fsm_time_99 : 0 node_put_fsm_time_100 : 0 read_repairs_total : 0 cpu_nprocs : 134 cpu_avg1 : 118 cpu_avg5 : 161 cpu_avg15 : 184 mem_total : 3264252000 mem_allocated : 3189744000 disk : [{"/",488050672,13}] nodename : '[email protected]' connected_nodes : ['[email protected]'] sys_driver_version : <<"1.5">> sys_global_heaps_size : 0 sys_heap_type : private sys_logical_processors : 2 sys_otp_release : <<"R14B01">> sys_process_count : 205 sys_smp_support : true sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">> sys_system_architecture : <<"i386-apple-darwin10.7.0">> sys_threads_enabled : true sys_thread_pool_size : 64 sys_wordsize : 8 ring_members : ['[email protected]','[email protected]','[email protected]'] ring_num_partitions : 64 ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{' [email protected]',22}]">> ring_creation_size : 64 storage_backend : riak_kv_bitcask_backend pbc_connects_total : 0 pbc_connects : 0 pbc_active : 0 riak_err_version : <<"1.0.1">> runtime_tools_version : <<"1.8.4.1">> basho_stats_version : <<"1.0.1">> luwak_version : <<"1.0.0">> skerl_version : <<"1.0.0">> riak_kv_version : <<"0.14.0">> bitcask_version : <<"1.1.5">> riak_core_version : <<"0.14.0">> riak_sysmon_version : <<"0.9.0">> luke_version : <<"0.2.3">> erlang_js_version : <<"0.5.0">> mochiweb_version : <<"1.7.1">> webmachine_version : <<"1.8.0">> crypto_version : <<"2.0.2">> os_mon_version : <<"2.2.5">> cluster_info_version : <<"1.1.0">> sasl_version : <<"2.1.9.2">> stdlib_version : <<"1.17.2">> kernel_version : <<"2.14.2">> executing_mappers : 0 On Fri, Apr 1, 2011 at 2:17 PM, Dan Reverri <[email protected]> wrote: > Hi Keith, > > The first set of errors you saw ("Protocol: ~p: register error: ~p~n") > indicate an Erlang node was already running with this name; node 2 may have > been running in the background without you realizing it. > > The second error which occurred when choosing a different name was probably > due to a port binding issue; this means the ports node 2 tried binding to > (handoff, web, pb) were already occupied. Again, node 2 may have already > been running in the background. > > After rebooting the machine it looks like starting node 2 was successful. > Regarding the ringready failure, can you run "riak-admin status" on all > three nodes? Also, can you send in the log files for node 2 (the entire log > directory would be great)? > > Thanks, > Dan > > Daniel Reverri > Developer Advocate > Basho Technologies, Inc. > [email protected] > > > On Fri, Apr 1, 2011 at 1:57 PM, Keith Dreibelbis <[email protected]>wrote: > >> Hi riak-users, >> >> I have a node in a cluster of 3 that failed and won't come back up. This >> is in a dev environment, so it's not like there's critical data on there. >> However, rather than start over with a new install, I want to learn how to >> recover from such a failure in production. I figured there was enough >> redundancy such that node 2 could recover with (at worst) a little help from >> nodes 1 and 3. >> >> When I tried to restart/reboot (I tried both), this showed up in >> erlang.log.1: >> >> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot >> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded >> -config /Users/keith/src/riak/dev/dev2/etc/app.confi >> >> g -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args -- >> console >> >> Root: /Users/keith/src/riak/dev/dev2 >> >> {error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error: >> ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta >> >> >> rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M >> >> >> {error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it >> >> >> ,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h >> >> eap_size,377},{stack_size,24},{reductions,456}],[]]}^M >> >> >> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel}, >> >> {mfargs,{net_kernel,start_link,[['[email protected] >> ',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M >> >> >> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di >> >> >> stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M >> >> >> {error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M >> >> {"Kernel pid >> terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M >> >> ^M >> >> Crash dump was written to: erl_crash.dump^M >> >> Kernel pid terminated (application_controller) >> ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M >> >> >> http://wiki.basho.com/Recovering-a-failed-node.html suggests starting the >> node in console mode with a different name. This didn't help, it just >> crashed again. I'm using bitcask (the default) while the example on that >> page gives output like InnoDB would return. >> >> kratos:dev2 keith$ bin/riak console -name differentname@nohost >> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot >> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded >> -config /Users/keith/src/riak/dev/dev2/etc/app.config -args_file >> /Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name >> differentname@nohost >> Root: /Users/keith/src/riak/dev/dev2 >> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] >> [async-threads:64] [hipe] [kernel-poll:true] >> >> >> =INFO REPORT==== 31-Mar-2011::17:35:05 === >> alarm_handler: {set,{system_memory_high_watermark,[]}} >> ** Found 0 name clashes in code paths >> >> =INFO REPORT==== 31-Mar-2011::17:35:05 === >> application: riak_core >> exited: {shutdown,{riak_core_app,start,[normal,[]]}} >> type: permanent >> >> =INFO REPORT==== 31-Mar-2011::17:35:05 === >> alarm_handler: {clear,system_memory_high_watermark} >> {"Kernel pid >> terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"} >> >> Crash dump was written to: erl_crash.dump >> Kernel pid terminated (application_controller) >> ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}) >> kratos:dev2 keith$ >> >> After I rebooted my machine and tried starting the trio of riak nodes, >> again node 2 is not responding to pings, and "riak-admin ringready" from >> nodes 1 and 3 complain that node 2 is down. But in the log, node 2 is >> saying it's ALIVE. Also, I can see processes for all 3 nodes in ps: >> >> kratos:~ keith$ ps auxww | grep riak >> keith 360 0.2 3.4 2606932 143044 s006 Ss+ 12:05PM 3:21.61 >> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 -- >> -root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith -- >> -boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config >> /Users/keith/src/riak/dev/dev1/etc/app.config -name [email protected] >> riak -- console >> keith 580 0.1 2.0 2549924 85492 s008 Ss+ 12:05PM 2:24.08 >> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 -- >> -root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith -- >> -boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config >> /Users/keith/src/riak/dev/dev3/etc/app.config -name [email protected] >> riak -- console >> keith 380 0.0 0.0 2435004 268 ?? S 12:05PM 0:00.08 >> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon >> keith 358 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01 >> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon >> /tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log >> exec /Users/keith/src/riak/dev/dev1/bin/riak console >> keith 1633 0.0 0.0 2435548 0 s010 R+ 1:34PM 0:00.00 >> grep riak >> keith 578 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.00 >> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon >> /tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log >> exec /Users/keith/src/riak/dev/dev3/bin/riak console >> keith 470 0.0 2.0 2548688 83584 s007 Ss+ 12:05PM 0:33.41 >> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 -- >> -root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith -- >> -boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config >> /Users/keith/src/riak/dev/dev2/etc/app.config -name [email protected] >> riak -- console >> keith 468 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01 >> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon >> /tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log >> exec /Users/keith/src/riak/dev/dev2/bin/riak console >> kratos:~ keith$ >> >> I've attached the erl_crash.dump file. Anyone have an explanation or >> suggestions on how to proceed? >> >> >> Keith >> >> >> _______________________________________________ >> riak-users mailing list >> [email protected] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >
dev2-log.tar.gz
Description: GNU Zip compressed data
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
