Hi Dan, It seems I have to say "never mind, it fixed itself". I killed it and ran the console, like you suggested, and after it output some messages about handoffs and merges, I did the commands you mentioned:
([email protected])1> node(). '[email protected]' ([email protected])2> erlang:get_cookie(). riak ([email protected])3> q(). and then "riak start" and the node is now happily back in the ring. What's surprised me was that "riak restart" and "riak reboot" didn't seem to do anything in this situation. It just got into an unresponsive state, and the process had to be killed to fix it. But perhaps this is the normal thing to do for an unresponsive node? Anyway, thanks for the help, my problem is resolved. Keith On Fri, Apr 1, 2011 at 4:41 PM, Dan Reverri <[email protected]> wrote: > Hi Keith, > > Can you try attaching to node 2 using "riak attach"? If that doesn't work, > manually kill node 2 and run "riak console". > > Once you have access to the console, type the following: > 1> node(). > % the console will output the node name here > > 2> erlang:get_cookie(). > % the console will output the cookie here > > Let me know what those commands output. > > Thanks, > Dan > > Daniel Reverri > Developer Advocate > Basho Technologies, Inc. > [email protected] > > > On Fri, Apr 1, 2011 at 2:34 PM, Keith Dreibelbis <[email protected]>wrote: > >> Thanks for the response, Dan. Yes, the problem is that it *looks* like >> starting node 2 was successful (says it's ALIVE, shows up in ps). But it's >> not responding to pings, isn't usable, and nodes 1 and 3 say node 2 isn't >> connected. >> >> As you suggested, here is the output of riak-admin status for the 3 nodes, >> and I'll attach a tarball for node 2's log directory. >> >> Keith >> >> kratos:dev1 keith$ bin/riak-admin status >> 1-minute stats for '[email protected]' >> ------------------------------------------- >> vnode gets : 0 >> vnode_puts : 0 >> read_repairs : 0 >> vnode_gets_total : 6251 >> vnode_puts_total : 1064 >> node_gets : 0 >> node_gets_total : 4786 >> node_get_fsm_time_mean : 0 >> node_get_fsm_time_median : 0 >> node_get_fsm_time_95 : 0 >> node_get_fsm_time_99 : 0 >> node_get_fsm_time_100 : 0 >> node_puts : 0 >> node_puts_total : 774 >> node_put_fsm_time_mean : 0 >> node_put_fsm_time_median : 0 >> node_put_fsm_time_95 : 0 >> node_put_fsm_time_99 : 0 >> node_put_fsm_time_100 : 0 >> read_repairs_total : 354 >> cpu_nprocs : 127 >> cpu_avg1 : 164 >> cpu_avg5 : 202 >> cpu_avg15 : 205 >> mem_total : 3264444000 >> mem_allocated : 3155680000 >> disk : [{"/",488050672,13}] >> nodename : '[email protected]' >> connected_nodes : ['[email protected]'] >> sys_driver_version : <<"1.5">> >> sys_global_heaps_size : 0 >> sys_heap_type : private >> sys_logical_processors : 2 >> sys_otp_release : <<"R14B01">> >> sys_process_count : 206 >> sys_smp_support : true >> sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit] >> [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">> >> sys_system_architecture : <<"i386-apple-darwin10.7.0">> >> sys_threads_enabled : true >> sys_thread_pool_size : 64 >> sys_wordsize : 8 >> ring_members : ['[email protected]','[email protected]','[email protected]'] >> ring_num_partitions : 64 >> ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{' >> [email protected]',22}]">> >> ring_creation_size : 64 >> storage_backend : riak_kv_bitcask_backend >> pbc_connects_total : 350 >> pbc_connects : 0 >> pbc_active : 0 >> riak_err_version : <<"1.0.1">> >> runtime_tools_version : <<"1.8.4.1">> >> basho_stats_version : <<"1.0.1">> >> luwak_version : <<"1.0.0">> >> skerl_version : <<"1.0.0">> >> riak_kv_version : <<"0.14.0">> >> bitcask_version : <<"1.1.5">> >> riak_core_version : <<"0.14.0">> >> riak_sysmon_version : <<"0.9.0">> >> luke_version : <<"0.2.3">> >> erlang_js_version : <<"0.5.0">> >> mochiweb_version : <<"1.7.1">> >> webmachine_version : <<"1.8.0">> >> crypto_version : <<"2.0.2">> >> os_mon_version : <<"2.2.5">> >> cluster_info_version : <<"1.1.0">> >> sasl_version : <<"2.1.9.2">> >> stdlib_version : <<"1.17.2">> >> kernel_version : <<"2.14.2">> >> executing_mappers : 0 >> >> kratos:dev2 keith$ bin/riak-admin status >> Node is not running! >> >> kratos:dev3 keith$ bin/riak-admin status >> 1-minute stats for '[email protected]' >> ------------------------------------------- >> vnode gets : 0 >> vnode_puts : 0 >> read_repairs : 0 >> vnode_gets_total : 7061 >> vnode_puts_total : 1198 >> node_gets : 0 >> node_gets_total : 0 >> node_get_fsm_time_mean : 0 >> node_get_fsm_time_median : 0 >> node_get_fsm_time_95 : 0 >> node_get_fsm_time_99 : 0 >> node_get_fsm_time_100 : 0 >> node_puts : 0 >> node_puts_total : 0 >> node_put_fsm_time_mean : 0 >> node_put_fsm_time_median : 0 >> node_put_fsm_time_95 : 0 >> node_put_fsm_time_99 : 0 >> node_put_fsm_time_100 : 0 >> read_repairs_total : 0 >> cpu_nprocs : 134 >> cpu_avg1 : 118 >> cpu_avg5 : 161 >> cpu_avg15 : 184 >> mem_total : 3264252000 >> mem_allocated : 3189744000 >> disk : [{"/",488050672,13}] >> nodename : '[email protected]' >> connected_nodes : ['[email protected]'] >> sys_driver_version : <<"1.5">> >> sys_global_heaps_size : 0 >> sys_heap_type : private >> sys_logical_processors : 2 >> sys_otp_release : <<"R14B01">> >> sys_process_count : 205 >> sys_smp_support : true >> sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit] >> [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">> >> sys_system_architecture : <<"i386-apple-darwin10.7.0">> >> sys_threads_enabled : true >> sys_thread_pool_size : 64 >> sys_wordsize : 8 >> ring_members : ['[email protected]','[email protected]','[email protected]'] >> ring_num_partitions : 64 >> ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{' >> [email protected]',22}]">> >> ring_creation_size : 64 >> storage_backend : riak_kv_bitcask_backend >> pbc_connects_total : 0 >> pbc_connects : 0 >> pbc_active : 0 >> riak_err_version : <<"1.0.1">> >> runtime_tools_version : <<"1.8.4.1">> >> basho_stats_version : <<"1.0.1">> >> luwak_version : <<"1.0.0">> >> skerl_version : <<"1.0.0">> >> riak_kv_version : <<"0.14.0">> >> bitcask_version : <<"1.1.5">> >> riak_core_version : <<"0.14.0">> >> riak_sysmon_version : <<"0.9.0">> >> luke_version : <<"0.2.3">> >> erlang_js_version : <<"0.5.0">> >> mochiweb_version : <<"1.7.1">> >> webmachine_version : <<"1.8.0">> >> crypto_version : <<"2.0.2">> >> os_mon_version : <<"2.2.5">> >> cluster_info_version : <<"1.1.0">> >> sasl_version : <<"2.1.9.2">> >> stdlib_version : <<"1.17.2">> >> kernel_version : <<"2.14.2">> >> executing_mappers : 0 >> >> >> >> On Fri, Apr 1, 2011 at 2:17 PM, Dan Reverri <[email protected]> wrote: >> >>> Hi Keith, >>> >>> The first set of errors you saw ("Protocol: ~p: register error: ~p~n") >>> indicate an Erlang node was already running with this name; node 2 may have >>> been running in the background without you realizing it. >>> >>> The second error which occurred when choosing a different name was >>> probably due to a port binding issue; this means the ports node 2 tried >>> binding to (handoff, web, pb) were already occupied. Again, node 2 may have >>> already been running in the background. >>> >>> After rebooting the machine it looks like starting node 2 was successful. >>> Regarding the ringready failure, can you run "riak-admin status" on all >>> three nodes? Also, can you send in the log files for node 2 (the entire log >>> directory would be great)? >>> >>> Thanks, >>> Dan >>> >>> Daniel Reverri >>> Developer Advocate >>> Basho Technologies, Inc. >>> [email protected] >>> >>> >>> On Fri, Apr 1, 2011 at 1:57 PM, Keith Dreibelbis <[email protected]>wrote: >>> >>>> Hi riak-users, >>>> >>>> I have a node in a cluster of 3 that failed and won't come back up. >>>> This is in a dev environment, so it's not like there's critical data on >>>> there. However, rather than start over with a new install, I want to learn >>>> how to recover from such a failure in production. I figured there was >>>> enough redundancy such that node 2 could recover with (at worst) a little >>>> help from nodes 1 and 3. >>>> >>>> When I tried to restart/reboot (I tried both), this showed up in >>>> erlang.log.1: >>>> >>>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot >>>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded >>>> -config /Users/keith/src/riak/dev/dev2/etc/app.confi >>>> >>>> g -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args -- >>>> console >>>> >>>> Root: /Users/keith/src/riak/dev/dev2 >>>> >>>> {error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error: >>>> ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta >>>> >>>> >>>> rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M >>>> >>>> >>>> {error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it >>>> >>>> >>>> ,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h >>>> >>>> eap_size,377},{stack_size,24},{reductions,456}],[]]}^M >>>> >>>> >>>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel}, >>>> >>>> {mfargs,{net_kernel,start_link,[['[email protected] >>>> ',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M >>>> >>>> >>>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di >>>> >>>> >>>> stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M >>>> >>>> >>>> {error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M >>>> >>>> {"Kernel pid >>>> terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M >>>> >>>> ^M >>>> >>>> Crash dump was written to: erl_crash.dump^M >>>> >>>> Kernel pid terminated (application_controller) >>>> ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M >>>> >>>> >>>> http://wiki.basho.com/Recovering-a-failed-node.html suggests starting >>>> the node in console mode with a different name. This didn't help, it just >>>> crashed again. I'm using bitcask (the default) while the example on that >>>> page gives output like InnoDB would return. >>>> >>>> kratos:dev2 keith$ bin/riak console -name differentname@nohost >>>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot >>>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded >>>> -config /Users/keith/src/riak/dev/dev2/etc/app.config >>>> -args_file >>>> /Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name >>>> differentname@nohost >>>> Root: /Users/keith/src/riak/dev/dev2 >>>> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] >>>> [async-threads:64] [hipe] [kernel-poll:true] >>>> >>>> >>>> =INFO REPORT==== 31-Mar-2011::17:35:05 === >>>> alarm_handler: {set,{system_memory_high_watermark,[]}} >>>> ** Found 0 name clashes in code paths >>>> >>>> =INFO REPORT==== 31-Mar-2011::17:35:05 === >>>> application: riak_core >>>> exited: {shutdown,{riak_core_app,start,[normal,[]]}} >>>> type: permanent >>>> >>>> =INFO REPORT==== 31-Mar-2011::17:35:05 === >>>> alarm_handler: {clear,system_memory_high_watermark} >>>> {"Kernel pid >>>> terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"} >>>> >>>> Crash dump was written to: erl_crash.dump >>>> Kernel pid terminated (application_controller) >>>> ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}) >>>> kratos:dev2 keith$ >>>> >>>> After I rebooted my machine and tried starting the trio of riak nodes, >>>> again node 2 is not responding to pings, and "riak-admin ringready" from >>>> nodes 1 and 3 complain that node 2 is down. But in the log, node 2 is >>>> saying it's ALIVE. Also, I can see processes for all 3 nodes in ps: >>>> >>>> kratos:~ keith$ ps auxww | grep riak >>>> keith 360 0.2 3.4 2606932 143044 s006 Ss+ 12:05PM 3:21.61 >>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 -- >>>> -root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith >>>> -- >>>> -boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config >>>> /Users/keith/src/riak/dev/dev1/etc/app.config -name >>>> [email protected] riak -- console >>>> keith 580 0.1 2.0 2549924 85492 s008 Ss+ 12:05PM 2:24.08 >>>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 -- >>>> -root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith >>>> -- >>>> -boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config >>>> /Users/keith/src/riak/dev/dev3/etc/app.config -name >>>> [email protected] riak -- console >>>> keith 380 0.0 0.0 2435004 268 ?? S 12:05PM 0:00.08 >>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon >>>> keith 358 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01 >>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon >>>> /tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log >>>> exec /Users/keith/src/riak/dev/dev1/bin/riak console >>>> keith 1633 0.0 0.0 2435548 0 s010 R+ 1:34PM 0:00.00 >>>> grep riak >>>> keith 578 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.00 >>>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon >>>> /tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log >>>> exec /Users/keith/src/riak/dev/dev3/bin/riak console >>>> keith 470 0.0 2.0 2548688 83584 s007 Ss+ 12:05PM 0:33.41 >>>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 -- >>>> -root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith >>>> -- >>>> -boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config >>>> /Users/keith/src/riak/dev/dev2/etc/app.config -name >>>> [email protected] riak -- console >>>> keith 468 0.0 0.0 2434988 264 ?? S 12:05PM 0:00.01 >>>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon >>>> /tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log >>>> exec /Users/keith/src/riak/dev/dev2/bin/riak console >>>> kratos:~ keith$ >>>> >>>> I've attached the erl_crash.dump file. Anyone have an explanation or >>>> suggestions on how to proceed? >>>> >>>> >>>> Keith >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> [email protected] >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>> >>> >> >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
