Hi Matt, I give more information:
1. There’s a lot of missing information here, like version of the cluster, what client you’re using, what the workload is: Cluster: Couchbase 2.5 Client: Spymemcached 2.8.4 This tool create 200 threads to connect to Couchbase cluster. Each thread Set a key and Get this key to check immediately, if success it continues Set/Get another key. If fail, it retry Set/Get and by pass this key if fail in 5 times. I see the cluster drop throughput from Couchbase Web Console http://ip:8091/ 2. I rewrite a loop as example but it failed too. In normal, i can have 300-400 ops but when a server down, it only serve 20-30 ops. My code: try { MemcachedClient c = new MemcachedClient( new BinaryConnectionFactory(), AddrUtil.getAddresses("10.0.0.20:11234 10.0.0.23:11234 10.0.0.24:11234")); for (int i = 0; i < 3000; i++) { String ini_key = "test_key"; String key = ini_key + i; Future<Object> f = null; try { c.set(key, 0, value); f = c.asyncGet(key); Object result = f.get(5, TimeUnit.SECONDS); boolean check = f.isDone(); if (check) { System.out.println(key + " " + check); } } catch (Exception e) { e.printStackTrace(); f.cancel(false); } } } catch (Exception ex) { ex.printStackTrace(); } This is log output (in this case, i stop Couchbase service on server 10.0.0.28, i think the connection has problem at this server but it show connection error at all server in cluster): *2014-05-03 23:40:42.241 ERROR net.spy.memcached.protocol.binary.StoreOperationImpl: Error: Internal error* *2014-05-03 23:40:42.242 INFO net.spy.memcached.MemcachedConnection: Reconnection due to exception handling a memcached operation on {QA sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 10957 Key: test_key5478 Cas: 0 Exp: 0 Flags: 0 Data Length: 804, topWop=null, toWrite=0, interested=1}. This may be due to an authentication failure.* *OperationException: SERVER: Internal error* * at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)* * at net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)* * at net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)* * at net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196)* * at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139)* * at net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)* * at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)* * at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)* * at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)* * at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)* * at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1400)* *2014-05-03 23:40:42.242 WARN net.spy.memcached.MemcachedConnection: Closing, and reopening {QA sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 10957 Key: test_key5478 Cas: 0 Exp: 0 Flags: 0 Data Length: 804, topWop=null, toWrite=0, interested=1}, attempt 0.* *2014-05-03 23:40:42.242 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 1 Opaque: 10957 Key: test_key5478 Cas: 0 Exp: 0 Flags: 0 Data Length: 804* *2014-05-03 23:40:42.242 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 10958 Key: test_key5478* *java.util.concurrent.ExecutionException: java.util.concurrent.CancellationException: Cancelled* * at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:177)* * at net.spy.memcached.internal.GetFuture.get(GetFuture.java:69)* * at toolcb.go(toolcb.java:45)* * at toolcb.main(toolcb.java:14)* *Caused by: java.util.concurrent.CancellationException: Cancelled* * ... 4 more* *test_key5479 true* *test_key5480 true* *test_key5481 true* *test_key5482 true* *test_key5483 true* *test_key5484 true* *test_key5485 true* *test_key5486 true* *test_key5487 true* *2014-05-03 23:40:42.318 ERROR net.spy.memcached.protocol.binary.StoreOperationImpl: Error: Internal error* *2014-05-03 23:40:42.319 INFO net.spy.memcached.MemcachedConnection: Reconnection due to exception handling a memcached operation on {QA sa=/10.0.0.20:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 10977 Key: test_key5488 Cas: 0 Exp: 0 Flags: 0 Data Length: 804, topWop=null, toWrite=0, interested=1}. This may be due to an authentication failure.* *OperationException: SERVER: Internal error* * at net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)* * at net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)* * at net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)* * at net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196)* * at net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139)* * at net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)* * at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)* * at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)* * at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)* * at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)* * at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1400)* *2014-05-03 23:40:42.320 WARN net.spy.memcached.MemcachedConnection: Closing, and reopening {QA sa=/10.0.0.20:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 10977 Key: test_key5488 Cas: 0 Exp: 0 Flags: 0 Data Length: 804, topWop=null, toWrite=0, interested=1}, attempt 0.* *2014-05-03 23:40:42.320 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 1 Opaque: 10977 Key: test_key5488 Cas: 0 Exp: 0 Flags: 0 Data Length: 804* *2014-05-03 23:40:42.320 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 10978 Key: test_key5488* *java.util.concurrent.ExecutionException: java.util.concurrent.CancellationException: Cancelled* * at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:177)* * at net.spy.memcached.internal.GetFuture.get(GetFuture.java:69)* * at toolcb.go(toolcb.java:45)* * at toolcb.main(toolcb.java:14)* *Caused by: java.util.concurrent.CancellationException: Cancelled* * ... 4 more* *test_key5489 true* On Friday, 2 May 2014 23:41:57 UTC+7, Matt Ingenthron wrote: > > Hi Phuc, > > From: Phuc Huu <[email protected] <javascript:>> > Reply-To: "[email protected] <javascript:>" < > [email protected] <javascript:>> > Date: Friday, May 2, 2014 at 3:03 AM > To: "[email protected] <javascript:>" > <[email protected]<javascript:> > > > Subject: How does Couchbase clucter response when one nodes down??? > > I'm testing Couchbase Server 2.5`. I have a cluster with 7 nodes and 3 > replicates. In normal condition, the system works fine. > > But I failed with this test case: Couchbase cluster's serving 40.000 ops > and I stop couchbase service on one server => one node down. After that, > entire cluster's performance is decreased painfully. It only can server > below 1.000 ops. When I click fail-over then entire cluster return healthy. > > > There’s a lot of missing information here, like version of the cluster, > what client you’re using, what the workload is. > > Based on what you say, I suspect your test is just running in a tight > loop with random keys. Before the failover, this means it will try to open > the connection and it will wait some time to try to get a response from > that node. The configuration is telling the client that the node should be > part of the cluster. That additional latency inserted into the tight loop > would give you the drop in throughput. > > Consider refactoring your test so the workload generation is constant > and you should see a drop in throughput commensurate with the small > reduction in nodes. > > Or, if you want to more simply test this theory, just pick a few random > keys and hit each in their own tight loop. When you stop the service on > the one node, those should maintain the same throughput. > > > Is this right behavior that Couchbase cluster response when one nodes > down??? Couchbase cluster will lose nearly all performance until i > fail-over. > > > I’m highly confident that Couchbase works correctly here. I think > you’re seeing the drop in throughput from your workload generator hitting > timeouts. > > Hope that helps, > > Matt > > -- > Matt Ingenthron > Couchbase, Inc. > -- You received this message because you are subscribed to the Google Groups "Couchbase" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
