On Thu, Apr 24, 2014 at 1:09 AM, Niels Boldt <[email protected]> wrote:
> Hi
>
> I'm running a couchbase 2.5 cluster with 5 machines on aws
>
> I have several times sees issues with rebalancing of the cluster while the
> cluster is loaded. Normally I add more nodes to handle more load
>
> What typically happens is that the cluster gets unstable, the rebalance
> fails.
>
> Then a node just goes down and a little while after it comes up again,
> Normally I get messages like
>
> Server error during processing: ["web request failed",
> {path,"/pools/default"},
> {type,exit},
> {what,
> {{{badmatch,undefined},
> [{ns_storage_conf,
>
> in the log
>
> The node just keeps on being unstable. It comes up again, bucket gets
> loaded and then a little while after this just happens again and again.
> This behaviour continues even if I remove all the load from the cluster and
> don't try new rebalances. So basically I have a cluster without handling
> any requests, but where a node just goes down, continuously
>
> What also seems odd to me was that almost all memory was reported as used
> in the ui > 95%, I tried removing a bucket, but memory usage just creeps
> upto > 95% again
>
> I have done a cbcollect on the nodes, and the files are here
>
>
> https://s3-eu-west-1.amazonaws.com/tmp-idtargeting-public/couchbase/ip-10-33-38-143.collect.cb.zip
>
> https://s3-eu-west-1.amazonaws.com/tmp-idtargeting-public/couchbase/ip-10-74-20-120.collect.cb.zip
>
> https://s3-eu-west-1.amazonaws.com/tmp-idtargeting-public/couchbase/ip-10-74-23-89.collect.cb.zip
>
> https://s3-eu-west-1.amazonaws.com/tmp-idtargeting-public/couchbase/ip-10-77-0-111.collect.cb.zip
>
> https://s3-eu-west-1.amazonaws.com/tmp-idtargeting-public/couchbase/ip-10-78-1-25.collect.cb.zip
>
> The node
> https://s3-eu-west-1.amazonaws.com/tmp-idtargeting-public/couchbase/ip-10-33-38-143.collect.cb.zip
> is the node that just kept on going down and I had to remove it from the
> cluster
>
> I have seen previous post on this mailing list that mentions that large
> documents could cause malloc problems and which again could cause the
> behaviour I have seen. But problem is I don't know what to look after in
> the above files.
>
>
I'm seeing node .45 had tons of problems up until 20:00 Apr 16. It was
constantly being killed by kernel's oom. I'm not sure what it was but now
it's clearly less severe. My prime suspicion is something within xdcr.
Right now I'm seeing that cluster does not have nodes that constantly go
down. But I do see frequent timeouts speaking to node .45. It still has a
lot of swap usage.
My recommendation:
* disable xdcr for now
* reboot node .45 or at least restart memcached on it
* get your cluster back into shape with more nodes if needed
* try enabling xdcr back and observe memory usage of beam.smp processes
--
You received this message because you are subscribed to the Google Groups
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.