Re: Bitcask node won't restart

Dan Reverri Fri, 01 Apr 2011 16:42:00 -0700

Hi Keith,

Can you try attaching to node 2 using "riak attach"? If that doesn't work,
manually kill node 2 and run "riak console".


Once you have access to the console, type the following:
1> node().
% the console will output the node name here

2> erlang:get_cookie().
% the console will output the cookie here

Let me know what those commands output.

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[email protected]


On Fri, Apr 1, 2011 at 2:34 PM, Keith Dreibelbis <[email protected]> wrote:

> Thanks for the response, Dan.  Yes, the problem is that it *looks* like
> starting node 2 was successful (says it's ALIVE, shows up in ps).  But it's
> not responding to pings, isn't usable, and nodes 1 and 3 say node 2 isn't
> connected.
>
> As you suggested, here is the output of riak-admin status for the 3 nodes,
> and I'll attach a tarball for node 2's log directory.
>
> Keith
>
> kratos:dev1 keith$ bin/riak-admin status
> 1-minute stats for '[email protected]'
> -------------------------------------------
> vnode gets : 0
> vnode_puts : 0
> read_repairs : 0
> vnode_gets_total : 6251
> vnode_puts_total : 1064
> node_gets : 0
> node_gets_total : 4786
> node_get_fsm_time_mean : 0
> node_get_fsm_time_median : 0
> node_get_fsm_time_95 : 0
> node_get_fsm_time_99 : 0
> node_get_fsm_time_100 : 0
> node_puts : 0
> node_puts_total : 774
> node_put_fsm_time_mean : 0
> node_put_fsm_time_median : 0
> node_put_fsm_time_95 : 0
> node_put_fsm_time_99 : 0
> node_put_fsm_time_100 : 0
> read_repairs_total : 354
> cpu_nprocs : 127
> cpu_avg1 : 164
> cpu_avg5 : 202
> cpu_avg15 : 205
> mem_total : 3264444000
> mem_allocated : 3155680000
> disk : [{"/",488050672,13}]
> nodename : '[email protected]'
> connected_nodes : ['[email protected]']
> sys_driver_version : <<"1.5">>
> sys_global_heaps_size : 0
> sys_heap_type : private
> sys_logical_processors : 2
> sys_otp_release : <<"R14B01">>
> sys_process_count : 206
> sys_smp_support : true
> sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit]
> [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">>
> sys_system_architecture : <<"i386-apple-darwin10.7.0">>
> sys_threads_enabled : true
> sys_thread_pool_size : 64
> sys_wordsize : 8
> ring_members : ['[email protected]','[email protected]','[email protected]']
> ring_num_partitions : 64
> ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{'
> [email protected]',22}]">>
> ring_creation_size : 64
> storage_backend : riak_kv_bitcask_backend
> pbc_connects_total : 350
> pbc_connects : 0
> pbc_active : 0
> riak_err_version : <<"1.0.1">>
> runtime_tools_version : <<"1.8.4.1">>
> basho_stats_version : <<"1.0.1">>
> luwak_version : <<"1.0.0">>
> skerl_version : <<"1.0.0">>
> riak_kv_version : <<"0.14.0">>
> bitcask_version : <<"1.1.5">>
> riak_core_version : <<"0.14.0">>
> riak_sysmon_version : <<"0.9.0">>
> luke_version : <<"0.2.3">>
> erlang_js_version : <<"0.5.0">>
> mochiweb_version : <<"1.7.1">>
> webmachine_version : <<"1.8.0">>
> crypto_version : <<"2.0.2">>
> os_mon_version : <<"2.2.5">>
> cluster_info_version : <<"1.1.0">>
> sasl_version : <<"2.1.9.2">>
> stdlib_version : <<"1.17.2">>
> kernel_version : <<"2.14.2">>
> executing_mappers : 0
>
> kratos:dev2 keith$ bin/riak-admin status
> Node is not running!
>
> kratos:dev3 keith$ bin/riak-admin status
> 1-minute stats for '[email protected]'
> -------------------------------------------
> vnode gets : 0
> vnode_puts : 0
> read_repairs : 0
> vnode_gets_total : 7061
> vnode_puts_total : 1198
> node_gets : 0
> node_gets_total : 0
> node_get_fsm_time_mean : 0
> node_get_fsm_time_median : 0
> node_get_fsm_time_95 : 0
> node_get_fsm_time_99 : 0
> node_get_fsm_time_100 : 0
> node_puts : 0
> node_puts_total : 0
> node_put_fsm_time_mean : 0
> node_put_fsm_time_median : 0
> node_put_fsm_time_95 : 0
> node_put_fsm_time_99 : 0
> node_put_fsm_time_100 : 0
> read_repairs_total : 0
> cpu_nprocs : 134
> cpu_avg1 : 118
> cpu_avg5 : 161
> cpu_avg15 : 184
> mem_total : 3264252000
> mem_allocated : 3189744000
> disk : [{"/",488050672,13}]
> nodename : '[email protected]'
> connected_nodes : ['[email protected]']
> sys_driver_version : <<"1.5">>
> sys_global_heaps_size : 0
> sys_heap_type : private
> sys_logical_processors : 2
> sys_otp_release : <<"R14B01">>
> sys_process_count : 205
> sys_smp_support : true
> sys_system_version : <<"Erlang R14B01 (erts-5.8.2) [source] [64-bit]
> [smp:2:2] [rq:2] [async-threads:64] [hipe] [kernel-poll:true]">>
> sys_system_architecture : <<"i386-apple-darwin10.7.0">>
> sys_threads_enabled : true
> sys_thread_pool_size : 64
> sys_wordsize : 8
> ring_members : ['[email protected]','[email protected]','[email protected]']
> ring_num_partitions : 64
> ring_ownership : <<"[{'[email protected]',21},{'[email protected]',21},{'
> [email protected]',22}]">>
> ring_creation_size : 64
> storage_backend : riak_kv_bitcask_backend
> pbc_connects_total : 0
> pbc_connects : 0
> pbc_active : 0
> riak_err_version : <<"1.0.1">>
> runtime_tools_version : <<"1.8.4.1">>
> basho_stats_version : <<"1.0.1">>
> luwak_version : <<"1.0.0">>
> skerl_version : <<"1.0.0">>
> riak_kv_version : <<"0.14.0">>
> bitcask_version : <<"1.1.5">>
> riak_core_version : <<"0.14.0">>
> riak_sysmon_version : <<"0.9.0">>
> luke_version : <<"0.2.3">>
> erlang_js_version : <<"0.5.0">>
> mochiweb_version : <<"1.7.1">>
> webmachine_version : <<"1.8.0">>
> crypto_version : <<"2.0.2">>
> os_mon_version : <<"2.2.5">>
> cluster_info_version : <<"1.1.0">>
> sasl_version : <<"2.1.9.2">>
> stdlib_version : <<"1.17.2">>
> kernel_version : <<"2.14.2">>
> executing_mappers : 0
>
>
>
> On Fri, Apr 1, 2011 at 2:17 PM, Dan Reverri <[email protected]> wrote:
>
>> Hi Keith,
>>
>> The first set of errors you saw ("Protocol: ~p: register error: ~p~n")
>> indicate an Erlang node was already running with this name; node 2 may have
>> been running in the background without you realizing it.
>>
>> The second error which occurred when choosing a different name was
>> probably due to a port binding issue; this means the ports node 2 tried
>> binding to (handoff, web, pb) were already occupied. Again, node 2 may have
>> already been running in the background.
>>
>> After rebooting the machine it looks like starting node 2 was successful.
>> Regarding the ringready failure, can you run "riak-admin status" on all
>> three nodes? Also, can you send in the log files for node 2 (the entire log
>> directory would be great)?
>>
>> Thanks,
>> Dan
>>
>> Daniel Reverri
>> Developer Advocate
>> Basho Technologies, Inc.
>> [email protected]
>>
>>
>> On Fri, Apr 1, 2011 at 1:57 PM, Keith Dreibelbis <[email protected]>wrote:
>>
>>> Hi riak-users,
>>>
>>> I have a node in a cluster of 3 that failed and won't come back up.  This
>>> is in a dev environment, so it's not like there's critical data on there.
>>> However, rather than start over with a new install, I want to learn how to
>>> recover from such a failure in production.  I figured there was enough
>>> redundancy such that node 2 could recover with (at worst) a little help from
>>> nodes 1 and 3.
>>>
>>> When I tried to restart/reboot (I tried both), this showed up in
>>> erlang.log.1:
>>>
>>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
>>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
>>> -config /Users/keith/src/riak/dev/dev2/etc/app.confi
>>>
>>> g             -args_file /Users/keith/src/riak/dev/dev2/etc/vm.args --
>>> console
>>>
>>> Root: /Users/keith/src/riak/dev/dev2
>>>
>>> {error_logger,{{2011,3,31},{16,43,35}},"Protocol: ~p: register error:
>>> ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,sta
>>>
>>>
>>> rt_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}^M
>>>
>>>
>>> {error_logger,{{2011,3,31},{16,43,35}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it
>>>
>>>
>>> ,6},{proc_lib,init_p_do_apply,3}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.138>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{h
>>>
>>> eap_size,377},{stack_size,24},{reductions,456}],[]]}^M
>>>
>>>
>>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},
>>>
>>> {mfargs,{net_kernel,start_link,[['[email protected]
>>> ',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}^M
>>>
>>>
>>> {error_logger,{{2011,3,31},{16,43,35}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_di
>>>
>>>
>>> stribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}^M
>>>
>>>
>>> {error_logger,{{2011,3,31},{16,43,35}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}^M
>>>
>>> {"Kernel pid
>>> terminated",application_controller,"{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}}"}^M
>>>
>>> ^M
>>>
>>> Crash dump was written to: erl_crash.dump^M
>>>
>>> Kernel pid terminated (application_controller)
>>> ({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})^M
>>>
>>>
>>> http://wiki.basho.com/Recovering-a-failed-node.html suggests starting
>>> the node in console mode with a different name.  This didn't help, it just
>>> crashed again.  I'm using bitcask (the default) while the example on that
>>> page gives output like InnoDB would return.
>>>
>>> kratos:dev2 keith$ bin/riak console -name differentname@nohost
>>> Exec: /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/erlexec -boot
>>> /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak             -embedded
>>> -config /Users/keith/src/riak/dev/dev2/etc/app.config             -args_file
>>> /Users/keith/src/riak/dev/dev2/etc/vm.args -- console -name
>>> differentname@nohost
>>> Root: /Users/keith/src/riak/dev/dev2
>>> Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2]
>>> [async-threads:64] [hipe] [kernel-poll:true]
>>>
>>>
>>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>>     alarm_handler: {set,{system_memory_high_watermark,[]}}
>>> ** Found 0 name clashes in code paths
>>>
>>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>>     application: riak_core
>>>     exited: {shutdown,{riak_core_app,start,[normal,[]]}}
>>>     type: permanent
>>>
>>> =INFO REPORT==== 31-Mar-2011::17:35:05 ===
>>>     alarm_handler: {clear,system_memory_high_watermark}
>>> {"Kernel pid
>>> terminated",application_controller,"{application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}}"}
>>>
>>> Crash dump was written to: erl_crash.dump
>>> Kernel pid terminated (application_controller)
>>> ({application_start_failure,riak_core,{shutdown,{riak_core_app,start,[normal,[]]}}})
>>> kratos:dev2 keith$
>>>
>>> After I rebooted my machine and tried starting the trio of riak nodes,
>>> again node 2 is not responding to pings, and "riak-admin ringready" from
>>> nodes 1 and 3 complain that node 2 is down.  But in the log, node 2 is
>>> saying it's ALIVE.  Also, I can see processes for all 3 nodes in ps:
>>>
>>> kratos:~ keith$ ps auxww | grep riak
>>> keith      360   0.2  3.4  2606932 143044 s006  Ss+  12:05PM   3:21.61
>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/beam.smp -K true -A 64 --
>>> -root /Users/keith/src/riak/dev/dev1 -progname riak -- -home /Users/keith --
>>> -boot /Users/keith/src/riak/dev/dev1/releases/0.14.0/riak -embedded -config
>>> /Users/keith/src/riak/dev/dev1/etc/app.config -name 
>>> [email protected] riak -- console
>>> keith      580   0.1  2.0  2549924  85492 s008  Ss+  12:05PM   2:24.08
>>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/beam.smp -K true -A 64 --
>>> -root /Users/keith/src/riak/dev/dev3 -progname riak -- -home /Users/keith --
>>> -boot /Users/keith/src/riak/dev/dev3/releases/0.14.0/riak -embedded -config
>>> /Users/keith/src/riak/dev/dev3/etc/app.config -name 
>>> [email protected] riak -- console
>>> keith      380   0.0  0.0  2435004    268   ??  S    12:05PM   0:00.08
>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/epmd -daemon
>>> keith      358   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
>>> /Users/keith/src/riak/dev/dev1/erts-5.8.2/bin/run_erl -daemon
>>> /tmp//Users/keith/src/riak/dev/dev1// /Users/keith/src/riak/dev/dev1/log
>>> exec /Users/keith/src/riak/dev/dev1/bin/riak console
>>> keith     1633   0.0  0.0  2435548      0 s010  R+    1:34PM   0:00.00
>>> grep riak
>>> keith      578   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.00
>>> /Users/keith/src/riak/dev/dev3/erts-5.8.2/bin/run_erl -daemon
>>> /tmp//Users/keith/src/riak/dev/dev3// /Users/keith/src/riak/dev/dev3/log
>>> exec /Users/keith/src/riak/dev/dev3/bin/riak console
>>> keith      470   0.0  2.0  2548688  83584 s007  Ss+  12:05PM   0:33.41
>>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/beam.smp -K true -A 64 --
>>> -root /Users/keith/src/riak/dev/dev2 -progname riak -- -home /Users/keith --
>>> -boot /Users/keith/src/riak/dev/dev2/releases/0.14.0/riak -embedded -config
>>> /Users/keith/src/riak/dev/dev2/etc/app.config -name 
>>> [email protected] riak -- console
>>> keith      468   0.0  0.0  2434988    264   ??  S    12:05PM   0:00.01
>>> /Users/keith/src/riak/dev/dev2/erts-5.8.2/bin/run_erl -daemon
>>> /tmp//Users/keith/src/riak/dev/dev2// /Users/keith/src/riak/dev/dev2/log
>>> exec /Users/keith/src/riak/dev/dev2/bin/riak console
>>> kratos:~ keith$
>>>
>>> I've attached the erl_crash.dump file.  Anyone have an explanation or
>>> suggestions on how to proceed?
>>>
>>>
>>> Keith
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [email protected]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Bitcask node won't restart

Reply via email to