Hi, I have a 12 node riak cluster running riak 0.14.2. I had several nodes crash with OOM errors, and after restarting them I see the following when running riak-admin transfers
Attempting to restart script through sudo -u riak '[email protected]' waiting to handoff 1 partitions '[email protected]' does not have 1 primary partitions running '[email protected]' waiting to handoff 1 partitions '[email protected]' does not have 1 primary partitions running '[email protected]' waiting to handoff 1 partitions '[email protected]' does not have 1 primary partitions running The only errors in the whole cluster are 2 errors on 10.5.10.30, both of the form =ERROR REPORT==== 15-Feb-2012::17:49:38 === Handoff receiver for partition 1044745311060762632934665329637030439832170528768 exiting abnormally after processing 7 objects: {timeout, {gen_fsm, sync_send_all_state_event, [<0.1299.1>, {handoff_data, <<141,144,203,78,2,49,20,134,207,12,76,4,77,12,49,38,38,174,221,54,25,102,64,134,189,151,141,168,33,6,217,145,127,218,142,157,138,29,3,229,9,120,3,119,190,133,91,151,238,92,243,68,182,74,162,184,162,77,78,206,245,59,253,187,27,204,15,78,178,110,154,138,62,151,44,73,58,49,139,139,172,199,210,12,9,227,167,73,12,153,246,179,162,43,143,95,107,75,181,35,168,49,155,84,185,150,220,62,17,81,48,247,118,171,249,169,111,87,53,65,205,217,132,87,198,74,99,85,83,80,93,148,220,34,68,203,221,6,110,17,171,150,254,119,84,240,55,247,205,241,230,200,175,222,27,179,97,137,71,54,186,195,3,122,24,59,66,31,6,23,232,224,25,9,46,113,239,162,129,243,135,46,51,69,14,141,54,82,156,109,144,2,79,58,92,147,174,48,183,108,80,137,178,40,165,80,181,156,40,106,231,20,209,75,78,225,232,195,141,249,230,214,242,71,78,136,243,252,230,154,175,214,48,21,250,98,157,150,246,221,149,2,19,209,74,191,125,238,235,95,161,193,246,66,55,159,40,40,226,83,9,227,64,118,182,144,90,187,143,92,24,33,139,210,72,241,5>>},60000]}} =ERROR REPORT==== 15-Feb-2012::17:49:41 === Handoff receiver for partition 1044745311060762632934665329637030439832170528768 exiting abnormally after processing 7 objects: {timeout, {gen_fsm, sync_send_all_state_event, [<0.1299.1>, {handoff_data, <<141,144,75,78,195,48,16,134,39,105,2,41,72,168,2,36,36,214,108,88,88,74,232,35,112,128,34,22,45,32,132,16,130,69,245,59,118,112,210,226,208,36,101,193,182,27,14,193,33,184,0,123,142,133,13,149,160,172,234,209,140,172,121,124,227,223,27,78,181,125,16,71,224,113,24,167,44,13,123,156,197,221,78,155,161,23,113,150,182,79,142,142,35,36,157,94,39,222,127,107,204,213,186,160,160,28,21,60,151,73,253,72,68,78,101,227,74,243,19,219,174,26,130,154,229,40,41,116,45,117,173,154,130,60,145,37,53,92,180,140,5,184,68,168,90,249,191,163,156,191,185,111,142,13,123,118,245,230,45,187,202,48,102,55,215,120,64,23,5,198,198,43,115,215,40,241,132,33,6,120,49,126,10,133,115,72,244,49,53,89,99,75,36,199,146,118,23,164,1,170,154,13,11,145,165,153,20,170,193,137,252,136,147,79,103,156,214,158,61,51,102,155,119,230,63,114,92,83,110,222,241,139,254,97,176,224,41,215,214,61,239,254,35,84,46,28,237,211,107,254,254,185,149,255,106,117,86,215,186,252,74,65,126,50,145,208,6,84,151,51,153,231,230,47,103,90,200,52,211,82,124,1>>},60000]}} I tried strobing through restarting all nodes, which seemed temporarily fix this particular node, but then I think this error cropped up. If there's anything I can try or more information I can give let me know. The boxes are 16 core, 24 GB memory, with data in bitcask on an SSD drive, there are 1024 partitions spread across 12 machines. Each machine does roughly 55-120K vnode gets per second, 20-40K node gets per second, 1-2K vnode puts, and 1-2K node puts. Thanks for the help, -Anthony -- ------------------------------------------------------------------------ Anthony Molinaro <[email protected]> _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
