Gabriel, We received information about this incident yesterday and have been discussing it internally. Thank you for providing such a detailed diagnosis. There must be something about the default httpchk headers that triggers a bug.
I'll follow up if we need more information. In the meantime, I'm glad your custom httpchk string fixed your issue. -- Luke Bakken CSE [email protected] On Thu, Apr 3, 2014 at 3:17 PM, Gabriel Littman <[email protected]> wrote: > Hi All, > > I once again return to the bottomless pool of knowledge that is the > riak mailing list. :) > > Recently we've started work to isolate our Riak cluster. (We > currently host other services on our Riak machines.) Part of that > work is to put Riak behind HAProxy and have our other services access > Riak through that. Then on Tuesday Riak started to eat up all of cpu > and memory, causing it to swap (yes swap is still on we plan to turn > it off after we isolate riak.) and the load to shoot up. It took a > while to figure out what was going on but it turns out that what had > changed was that HAProxy. When we turned of the load balancer and > restarted Riak it would stay happy but as soon as we turned it (with > the 'check' option) cpu and memory will slowly creep up until the > machine is unusable. > > It's very confusing since we have already done this work in our > staging cluster and not had any of these problems. We are going to do > a more thorough analysis of differences in hardware and configurations > but we are pretty good about packaging and deploying important > settings in a standard way. > > Also we were not able to reproduce this except by using HAProxy. > Meaning when we created a script to try to load up /ping and riak > handled it just fine. When we looked deeper and sniffed the network > my coworkers noticed that curl and haproxy requests were slightly > different. When we added some to the header info and used HTTP 1.1 > instead of 1.0 to haproxy it seems to not have the same affect on > riak. > > option httpchk GET /ping HTTP/1.1\r\nHost:\ > riak\r\nUser-Agent:\ curl/7.22.0\r\nHost:\ riak:8098\r\nAccept:\ > */*\r\n > > I've attached a bunch of logs and configurations. Any advice or > insights would be much appreciated. > > Thanks, > > Gabe > > > > More Info: > riak 1.4.1 > 5 nodes > nval 2 > 256 partitions > > ubuntu 12.0.4 > 12 core system > 32g mem > 32g swap > > > Some suspicious looking riak-admin top entries (don't really know what > they mean): > <6201.16207.0> proc_lib:init_p/5 '-' 3211097921 > 4455251976 62 riak_kv_index_fsm:update_buffer/3 > <6201.1908.0> proc_lib:init_p/5 '-' 190425738 > 88592 0 riak_kv_vnode:'-result_fun_ack/2-fun-0-' > > > <6201.15965.0> proc_lib:init_p/5 '-' 4156696872 > 3564395400 63 sms:'-values/1-lc$^0/1-0-'/1 > <6201.23152.0> proc_lib:init_p/5 '-' 3400754909 > 1460972816 99 orddict:update/3 > <6201.1913.0> proc_lib:init_p/5 '-' 152610162 > 88448 0 gen_server:loop/6 > <6201.1654.0> proc_lib:init_p/5 '-' 125490325 > 55176 0 riak_kv_vnode:'-result_fun_ack/2-fun-0-' > <6201.1450.0> proc_lib:init_p/5 '-' 124185894 > 34504 0 riak_kv_vnode:'-result > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
