Re: High CPU on a single node in production

Fred Dushin Wed, 06 Jan 2016 13:06:18 -0800

Hi Josh,

Sorry for not getting back sooner.


I am not entirely sure what is going on with your handoffs.  It could be that 
you have overloaded Solr with handoff activity, and that is causing vnodes to 
become unresponsive.  We are actively working on a fix for this, which allows 
vnodes to continue their work, even if Solr is taking its time with ingest.  
This also includes batching, with aggregates insertion (and deletion) 
operations into Solr, to smooth out some of the bumps, while at the same time 
being a better Solr citizen.

The hundreds of entries you see on close trees seem to be due to the fact that 
your YZ AAE trees are in need of being rebuilt.  Can it be that you have hit 
the magic 7 day grace period on AAE tree expiry?  The index failures you see in 
the logs seem to be because the yz_entropy_mgr has been shut down.  You are 
seeing this during a riak stop, correct?  Is the system under high indexing 
load, at the time?  That could account for the log messages, as index 
operations may be coming in while yokozuna is being shut down.

Regarding ring resize, please have a look at 
https://github.com/basho/yokozuna/issues/279 
<https://github.com/basho/yokozuna/issues/279>.  I do not believe these issues 
have been rectified, so the official line is what you see in the documentation. 
 You can, of course, reindex your data after a ring resize, but that is not 
acceptable in a production scenario, if you have an SLA around search 
availability.

Hope that helps, and let us know if you have any more information about what 
might be consuming CPU on your nodes.  I would keep a close eye on the vnode 
queue lengths in the Riak stats (riak_kv_vnodeq_(min|max|avg|mean|etc)).  If 
your vnode queues start getting deep, then vnodes are likely being blocked by 
Solr.

-Fred

> On Jan 6, 2016, at 1:03 PM, Josh Yudaken <j...@smyte.com> wrote:
> 
> Hi Luke,
> 
> We're planning on having a rather large cluster rather soon, which was
> the reason for the large ring size. Your documentation indicates ring
> resize is *not* possible with search 2.0 [1], although an issue I
> found on github indicated it might be now? [2]
> 
> If the situation is resolved we might be open to resizing our ring
> now, but given the trouble we're seeing with normal handoffs that
> seems like a bad idea. Is the 4x ring size expected to completely
> break Riak like we're seeing in production, or just a bit of extra
> strain/latency?
> 
> I've been through the tuning list multiple times, and haven't seen any
> changes. I migrated the machine seeing issues to a new host, and now
> the new host is seeing similar problems. Heres a screenshot of `htop`
> just before I stopped the node in order to bring our site back up [3].
> 
> Regards,
> Josh
> 
> [1] 
> http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/#Feature-Incompatibility
> [2] https://github.com/basho/basho_docs/issues/1742
> [3] https://slack-files.com/T031MU137-F0HRU4E94-c3ab1e776e
> 
> On Wed, Jan 6, 2016 at 6:39 AM, Luke Bakken <lbak...@basho.com> wrote:
>> Hi Josh,
>> 
>> 1024 is too large of a ring size for 10 nodes. If it's possible to
>> rebuild your cluster using a ring size of 128 or 256 that would be
>> ideal 
>> (http://docs.basho.com/riak/latest/ops/building/planning/cluster/#Ring-Size-Number-of-Partitions).
>> Ring resizing is possible as well
>> (http://docs.basho.com/riak/latest/ops/advanced/ring-resizing/).
>> 
>> Have all of our recommended performance tunings been applied to every
>> node in this cluster?
>> (http://docs.basho.com/riak/latest/ops/tuning/linux/) - these can have
>> a dramatic effect on cluster performance.
>> 
>> --
>> Luke Bakken
>> Engineer
>> lbak...@basho.com
>> 
>> On Tue, Jan 5, 2016 at 10:52 AM, Josh Yudaken <j...@smyte.com> wrote:
>>> Hi,
>>> 
>>> We're attempting to use Riak as our primary key-value and search
>>> database for an analytics-typed solution to blocking spam/fraud.
>>> 
>>> As we expect to eventually be handling a huge amount of data, I
>>> started with a ring size of 1024. We currently have 10 nodes on Google
>>> Cloud n1-standard-16 instances [ 16 cores, 60gb RAM, 720gb local ssd.
>>> ]. Disks are at about 60% usage [ roughly 175gb leveldb, 16gb yz, 45gb
>>> anti_entropy, 6gb yz_anti_entropy ], and request wise we're at about
>>> 20k/min get, 4k/min set. Load average is usually around 6.
>>> 
>>> I'm assuming most of the issues we're seeing are Yokozuna related, but
>>> we're seeing a ton of tcp timeouts during handoffs, very slow get/set
>>> queries, and a slew of other errors.
>>> 
>>> Right now I'm trying to debug an issue where one of the 10 nodes
>>> pegged all the cpu cores. Mostly with the `bean` process.
>>> 
>>> # riak-admin top
>>> Output server crashed: connection_lost
>>> 
>>> With few other options (as it was causing slow queries across the
>>> cluster) I stopped the server and saw hundreds of the following
>>> (interesting) messages in the log::
>>> 
>>> 2016-01-05 18:28:28.573 [info]
>>> <0.4958.0>@yz_index_hashtree:close_trees:557 Deliberately marking YZ
>>> hashtree {1458647141945490998441568260777384029383167049728,3} for
>>> full rebuild on next restart
>>> 
>>> As well as a ton of (I think related?):
>>> 2016-01-05 18:28:31.153 [error] <0.5982.0>@yz_kv:index_internal:237
>>> failed to index object
>>> {{<<"features">>,<<"features">>},<<"0NKqMtj3O6_">>} with error
>>> {noproc,{gen_server,call,[yz_entropy_mgr,{get_tree,1120389438774178506630754486017853682060456099840},infinity]}}
>>> because 
>>> [{gen_server,call,3,[{file,"gen_server.erl"},{line,188}]},{yz_kv,get_and_set_tree,1,[{file,"src/yz_kv.erl"},{line,452}]},{yz_kv,update_hashtree,4,[{file,"src/yz_kv.erl"},{line,340}]},{yz_kv,index,7,[{file,"src/yz_kv.erl"},{line,295}]},{yz_kv,index_internal,5,[{file,"src/yz_kv.erl"},{line,224}]},{riak_kv_vnode,actual_put,6,[{file,"src/riak_kv_vnode.erl"},{line,1619}]},{riak_kv_vnode,perform_put,3,[{file,"src/riak_kv_vnode.erl"},{line,1607}]},{riak_kv_vnode,do_put,7,[{file,"src/riak_kv_vnode.erl"},{line,1398}]}]
>>> 
>>> For reference the TCP timeout error looks like:
>>> 
>>> 2016-01-01 01:09:50.522 [error]
>>> <0.8430.6>@riak_core_handoff_sender:start_fold:272 hinted transfer of
>>> riak_kv_vnode from 'riak@riak25-2.c.authbox-api.internal'
>>> 185542200051774784537577176028434367729757061120 to
>>> 'riak@riak27-2.c.authbox-api.internal'
>>> 185542200051774784537577176028434367729757061120 failed because of TCP
>>> recv timeout
>>> 
>>> Any suggestions about where to look?
>>> 
>>> Regards,
>>> Josh
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: High CPU on a single node in production

Reply via email to