Hi Keith, Can you try attaching to node 2 using "riak attach"? If that doesn't work, manually kill node 2 and run "riak console".
Once you have access to the console, type the following: 1> node(). % the console will output the node name here 2> erlang:get_cookie(). % the console will output the cookie here Let me know what those commands output. Thanks, Dan Daniel Reverri Developer Advocate Basho Technologies, Inc. [email protected] On Fri, Apr 1, 2011 at 2:34 PM, Keith Dreibelbis <[email protected]> wrote: > Thanks for the response, Dan. Yes, the problem is that it *looks* like > starting node 2 was successful (says it's ALIVE, shows up in ps). But it's > not responding to pings, isn't usable, and nodes 1 and 3 say node 2 isn't > connected. > > As you suggested, here is the output of riak-admin status for the 3 nodes, > and I'll attach a tarball for node 2's log directory. > > Keith > > kratos:dev1 keith$ bin/riak-admin status > 1-minute stats for '[email protected]' > ------------------------------------------- > vnode gets : 0 > vnode_puts : 0 > read_repairs : 0 > vnode_gets_total : 6251 > vnode_puts_total : 1064 > node_gets : 0 > node_gets_total : 4786 > node_get_fsm_time_mean : 0 > node_get_fsm_time_median : 0 > node_get_fsm_time_95 : 0 > node_get_fsm_time_99 : 0 > node_get_fsm_time_100 : 0 > node_puts : 0 > node_puts_total : 774 > node_put_fsm_time_mean : 0 > node_put_fsm_time_median : 0 > node_put_fsm_time_95 : 0 > node_put_fsm_time_99 : 0 > node_put_fsm_time_100 : 0 > read_repairs_total : 354 > cpu_nprocs : 127 > cpu_avg1 : 164 > cpu_avg5 : 202 > cpu_avg15 : 205 > mem_total : 3264444000 > mem_allocated : 3155680000 > disk : [{"/",488050672,13}] > nodename : '[email protected]' > connected_nodes : ['[email protected]'] > sys_driver_version : <<"1.5">> > sys_global_heaps_size : 0 > sys_heap_type : private > sys_logical_processors : 2 > sys_otp_release : <<"R14B01">> > sys_process_count : 206 > sys_smp_support : true > sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit] > [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">> > sys_system_architecture : <<"i386-apple-darwin10.7.0">> > sys_threads_enabled : true > sys_thread_pool_size : 64 > sys_wordsize : 8 > ring_members : ['[email protected]','[email protected]','[email protected]'] > ring_num_partitions : 64 > ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{' > [email protected]',22}]">> > ring_creation_size : 64 > storage_backend : riak_kv_bitcask_backend > pbc_connects_total : 350 > pbc_connects : 0 > pbc_active : 0 > riak_err_version : <<"1.0.1">> > runtime_tools_version : <<"1.8.4.1">> > basho_stats_version : <<"1.0.1">> > luwak_version : <<"1.0.0">> > skerl_version : <<"1.0.0">> > riak_kv_version : <<"0.14.0">> > bitcask_version : <<"1.1.5">> > riak_core_version : <<"0.14.0">> > riak_sysmon_version : <<"0.9.0">> > luke_version : <<"0.2.3">> > erlang_js_version : <<"0.5.0">> > mochiweb_version : <<"1.7.1">> > webmachine_version : <<"1.8.0">> > crypto_version : <<"2.0.2">> > os_mon_version : <<"2.2.5">> > cluster_info_version : <<"1.1.0">> > sasl_version : <<"2.1.9.2">> > stdlib_version : <<"1.17.2">> > kernel_version : <<"2.14.2">> > executing_mappers : 0 > > kratos:dev2 keith$ bin/riak-admin status > Node is not running! > > kratos:dev3 keith$ bin/riak-admin status > 1-minute stats for '[email protected]' > ------------------------------------------- > vnode gets : 0 > vnode_puts : 0 > read_repairs : 0 > vnode_gets_total : 7061 > vnode_puts_total : 1198 > node_gets : 0 > node_gets_total : 0 > node_get_fsm_time_mean : 0 > node_get_fsm_time_median : 0 > node_get_fsm_time_95 : 0 > node_get_fsm_time_99 : 0 > node_get_fsm_time_100 : 0 > node_puts : 0 > node_puts_total : 0 > node_put_fsm_time_mean : 0 > node_put_fsm_time_median : 0 > node_put_fsm_time_95 : 0 > node_put_fsm_time_99 : 0 > node_put_fsm_time_100 : 0 > read_repairs_total : 0 > cpu_nprocs : 134 > cpu_avg1 : 118 > cpu_avg5 : 161 > cpu_avg15 : 184 > mem_total : 3264252000 > mem_allocated : 3189744000 > disk : [{"/",488050672,13}] > nodename : '[email protected]' > connected_nodes : ['[email protected]'] > sys_driver_version : <<"1.5">> > sys_global_heaps_size : 0 > sys_heap_type : private > sys_logical_processors : 2 > sys_otp_release : <<"R14B01">> > sys_process_count : 205 > sys_smp_support : true > sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit] > [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">> > sys_system_architecture : <<"i386-apple-darwin10.7.0">> > sys_threads_enabled : true > sys_thread_pool_size : 64 > sys_wordsize : 8 > ring_members : ['[email protected]','[email protected]','[email protected]'] > ring_num_partitions : 64 > ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{' > [email protected]',22}]">> > ring_creation_size : 64 > storage_backend : riak_kv_bitcask_backend > pbc_connects_total : 0 > pbc_connects : 0 > pbc_active : 0 > riak_err_version : <<"1.0.1">> > runtime_tools_version : <<"1.8.4.1">> > basho_stats_version : <<"1.0.1">> > luwak_version : <<"1.0.0">> > skerl_version : <<"1.0.0">> > riak_kv_version : <<"0.14.0">> > bitcask_version : <<"1.1.5">> > riak_core_version : <<"0.14.0">> > riak_sysmon_version : <<"0.9.0">> > luke_version : <<"0.2.3">> > erlang_js_version : <<"0.5.0">> > mochiweb_version : <<"1.7.1">> > webmachine_version : <<"1.8.0">> > crypto_version : <<"2.0.2">> > os_mon_version : <<"2.2.5">> > cluster_info_version : <<"1.1.0">> > sasl_version : <<"2.1.9.2">> > stdlib_version : <<"1.17.2">> > kernel_version : <<"2.14.2">> > executing_mappers : 0 > > > > On Fri, Apr 1, 2011 at 2:17 PM, Dan Reverri <[email protected]> wrote: > >> Hi Keith, >> >> The first set of errors you saw ("Protocol: ~p: register error: ~p~n") >> indicate an Erlang node was already running with this name; node 2 may have >> been running in the background without you realizing it. >> >> The second error which occurred when choosing a different name was >> probably due to a port binding issue; this means the ports node 2 tried >> binding to (handoff, web, pb) were already occupied. Again, node 2 may have >> already been running in the background. >> >> After rebooting the machine it looks like starting node 2 was successful. >> Regarding the ringready failure, can you run "riak-admin status" on all >> three nodes? Also, can you send in the log files for node 2 (the entire log >> directory would be great)? >> >> Thanks, >> Dan >> >> Daniel Reverri >> Developer Advocate >> Basho Technologies, Inc. >> [email protected] >> >> >> On Fri, Apr 1, 2011 at 1:57 PM, Keith Dreibelbis <[email protected]>wrote: >> >>> Hi riak-users, >>> >>> I have a node in a cluster of 3 that failed and won't come back up. This >>> is in a dev environment, so it's not like there's critical data on there. >>> However, rather than start over with a new install, I want to learn how to >>> recover from such a failure in production. I figured there was enough >>> redundancy such that node 2 could recover with (at worst) a little help from >>> nodes 1 and 3. >>> >>> When I tried to restart/reboot (I tried both), this showed up in >>> erlang.log.1: >>> >>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot >>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded >>> -config /Users/keith/src/riak/dev/dev2/etc/app.confi >>> >>> g -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args -- >>> console >>> >>> Root: /Users/keith/src/riak/dev/dev2 >>> >>> {error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error: >>> ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta >>> >>> >>> rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M >>> >>> >>> {error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it >>> >>> >>> ,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h >>> >>> eap_size,377},{stack_size,24},{reductions,456}],[]]}^M >>> >>> >>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel}, >>> >>> {mfargs,{net_kernel,start_link,[['[email protected] >>> ',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M >>> >>> >>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di >>> >>> >>> stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M >>> >>> >>> {error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M >>> >>> {"Kernel pid >>> terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M >>> >>> ^M >>> >>> Crash dump was written to: erl_crash.dump^M >>> >>> Kernel pid terminated (application_controller) >>> ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M >>> >>> >>> http://wiki.basho.com/Recovering-a-failed-node.html suggests starting >>> the node in console mode with a different name. This didn't help, it just >>> crashed again. I'm using bitcask (the default) while the example on that >>> page gives output like InnoDB would return. >>> >>> kratos:dev2 keith$ bin/riak console -name differentname@nohost >>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot >>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded >>> -config /Users/keith/src/riak/dev/dev2/etc/app.config -args_file >>> /Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name >>> differentname@nohost >>> Root: /Users/keith/src/riak/dev/dev2 >>> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] >>> [async-threads:64] [hipe] [kernel-poll:true] >>> >>> >>> =INFO REPORT==== 31-Mar-2011::17:35:05 === >>> alarm_handler: {set,{system_memory_high_watermark,[]}} >>> ** Found 0 name clashes in code paths >>> >>> =INFO REPORT==== 31-Mar-2011::17:35:05 === >>> application: riak_core >>> exited: {shutdown,{riak_core_app,start,[normal,[]]}} >>> type: permanent >>> >>> =INFO REPORT==== 31-Mar-2011::17:35:05 === >>> alarm_handler: {clear,system_memory_high_watermark} >>> {"Kernel pid >>> terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"} >>> >>> Crash dump was written to: erl_crash.dump >>> Kernel pid terminated (application_controller) >>> ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}) >>> kratos:dev2 keith$ >>> >>> After I rebooted my machine and tried starting the trio of riak nodes, >>> again node 2 is not responding to pings, and "riak-admin ringready" from >>> nodes 1 and 3 complain that node 2 is down. But in the log, node 2 is >>> saying it's ALIVE. Also, I can see processes for all 3 nodes in ps: >>> >>> kratos:~ keith$ ps auxww | grep riak >>> keith 360 0.2 3.4 2606932 143044 s006 Ss+ 12:05PM 3:21.61 >>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 -- >>> -root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith -- >>> -boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config >>> /Users/keith/src/riak/dev/dev1/etc/app.config -name >>> [email protected] riak -- console >>> keith 580 0.1 2.0 2549924 85492 s008 Ss+ 12:05PM 2:24.08 >>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 -- >>> -root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith -- >>> -boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config >>> /Users/keith/src/riak/dev/dev3/etc/app.config -name >>> [email protected] riak -- console >>> keith 380 0.0 0.0 2435004 268 ?? S 12:05PM 0:00.08 >>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon >>> keith 358 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01 >>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon >>> /tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log >>> exec /Users/keith/src/riak/dev/dev1/bin/riak console >>> keith 1633 0.0 0.0 2435548 0 s010 R+ 1:34PM 0:00.00 >>> grep riak >>> keith 578 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.00 >>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon >>> /tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log >>> exec /Users/keith/src/riak/dev/dev3/bin/riak console >>> keith 470 0.0 2.0 2548688 83584 s007 Ss+ 12:05PM 0:33.41 >>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 -- >>> -root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith -- >>> -boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config >>> /Users/keith/src/riak/dev/dev2/etc/app.config -name >>> [email protected] riak -- console >>> keith 468 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01 >>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon >>> /tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log >>> exec /Users/keith/src/riak/dev/dev2/bin/riak console >>> kratos:~ keith$ >>> >>> I've attached the erl_crash.dump file. Anyone have an explanation or >>> suggestions on how to proceed? >>> >>> >>> Keith >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> [email protected] >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >> >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
