Re: How does Couchbase clucter response when one nodes down???

Phuc Huu Sat, 03 May 2014 11:37:30 -0700

Hi Matt,

I give more information:

1. There’s a lot of missing information here, like version of the cluster, 
what client you’re using, what the workload is:

Cluster: Couchbase 2.5
Client: Spymemcached 2.8.4

This tool create 200 threads to connect to Couchbase cluster. Each thread 
Set a key and Get this key to check immediately, if success it continues 
Set/Get another key. If fail, it retry Set/Get and by pass this key if fail 
in 5 times.

I see the cluster drop throughput from Couchbase Web Console http://ip:8091/

2. I rewrite a loop as example but it failed too. In normal, i can have 
300-400 ops but when a server down, it only serve 20-30 ops.

My code:
        try {
            MemcachedClient c = new MemcachedClient(
                    new BinaryConnectionFactory(),
                    AddrUtil.getAddresses("10.0.0.20:11234 10.0.0.23:11234 
10.0.0.24:11234"));

            for (int i = 0; i < 3000; i++) {
                String ini_key = "test_key";
                String key = ini_key + i;
                Future<Object> f = null;
                try {
                    c.set(key, 0, value);
                    f = c.asyncGet(key);

                    Object result = f.get(5, TimeUnit.SECONDS);
                    boolean check = f.isDone();

                    if (check) {
                        System.out.println(key + " " + check);
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                    f.cancel(false);
                }

            }

        } catch (Exception ex) {
            ex.printStackTrace();
        }

This is log output (in this case, i stop Couchbase service on server 
10.0.0.28, i think the connection has problem at this server but it show 
connection error at all server in cluster):

*2014-05-03 23:40:42.241 ERROR 
net.spy.memcached.protocol.binary.StoreOperationImpl:  Error:  Internal 
error*
*2014-05-03 23:40:42.242 INFO net.spy.memcached.MemcachedConnection: 
 Reconnection due to exception handling a memcached operation on {QA 
sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 10957 
Key: test_key5478 Cas: 0 Exp: 0 Flags: 0 Data Length: 804, topWop=null, 
toWrite=0, interested=1}. This may be due to an authentication failure.*
*OperationException: SERVER: Internal error*
*    at 
net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139)*
*    at 
net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)*
*    at 
net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)*
*    at 
net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)*
*    at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)*
*    at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)*
*    at 
net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1400)*
*2014-05-03 23:40:42.242 WARN net.spy.memcached.MemcachedConnection: 
 Closing, and reopening {QA sa=/10.0.0.24:11234, #Rops=2, #Wops=0, #iq=0, 
topRop=Cmd: 1 Opaque: 10957 Key: test_key5478 Cas: 0 Exp: 0 Flags: 0 Data 
Length: 804, topWop=null, toWrite=0, interested=1}, attempt 0.*
*2014-05-03 23:40:42.242 WARN 
net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding 
partially completed op: Cmd: 1 Opaque: 10957 Key: test_key5478 Cas: 0 Exp: 
0 Flags: 0 Data Length: 804*
*2014-05-03 23:40:42.242 WARN 
net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding 
partially completed op: Cmd: 0 Opaque: 10958 Key: test_key5478*
*java.util.concurrent.ExecutionException: 
java.util.concurrent.CancellationException: Cancelled*
*    at 
net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:177)*
*    at net.spy.memcached.internal.GetFuture.get(GetFuture.java:69)*
*    at toolcb.go(toolcb.java:45)*
*    at toolcb.main(toolcb.java:14)*
*Caused by: java.util.concurrent.CancellationException: Cancelled*
*    ... 4 more*
*test_key5479 true*
*test_key5480 true*
*test_key5481 true*
*test_key5482 true*
*test_key5483 true*
*test_key5484 true*
*test_key5485 true*
*test_key5486 true*
*test_key5487 true*
*2014-05-03 23:40:42.318 ERROR 
net.spy.memcached.protocol.binary.StoreOperationImpl:  Error:  Internal 
error*
*2014-05-03 23:40:42.319 INFO net.spy.memcached.MemcachedConnection: 
 Reconnection due to exception handling a memcached operation on {QA 
sa=/10.0.0.20:11234, #Rops=2, #Wops=0, #iq=0, topRop=Cmd: 1 Opaque: 10977 
Key: test_key5488 Cas: 0 Exp: 0 Flags: 0 Data Length: 804, topWop=null, 
toWrite=0, interested=1}. This may be due to an authentication failure.*
*OperationException: SERVER: Internal error*
*    at 
net.spy.memcached.protocol.BaseOperationImpl.handleError(BaseOperationImpl.java:192)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.getStatusForErrorCode(OperationImpl.java:244)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.finishedPayload(OperationImpl.java:201)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.readPayloadFromBuffer(OperationImpl.java:196)*
*    at 
net.spy.memcached.protocol.binary.OperationImpl.readFromBuffer(OperationImpl.java:139)*
*    at 
net.spy.memcached.MemcachedConnection.readBufferAndLogMetrics(MemcachedConnection.java:825)*
*    at 
net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:804)*
*    at 
net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:684)*
*    at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:647)*
*    at 
net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:418)*
*    at 
net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1400)*
*2014-05-03 23:40:42.320 WARN net.spy.memcached.MemcachedConnection: 
 Closing, and reopening {QA sa=/10.0.0.20:11234, #Rops=2, #Wops=0, #iq=0, 
topRop=Cmd: 1 Opaque: 10977 Key: test_key5488 Cas: 0 Exp: 0 Flags: 0 Data 
Length: 804, topWop=null, toWrite=0, interested=1}, attempt 0.*
*2014-05-03 23:40:42.320 WARN 
net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding 
partially completed op: Cmd: 1 Opaque: 10977 Key: test_key5488 Cas: 0 Exp: 
0 Flags: 0 Data Length: 804*
*2014-05-03 23:40:42.320 WARN 
net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl:  Discarding 
partially completed op: Cmd: 0 Opaque: 10978 Key: test_key5488*
*java.util.concurrent.ExecutionException: 
java.util.concurrent.CancellationException: Cancelled*
*    at 
net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:177)*
*    at net.spy.memcached.internal.GetFuture.get(GetFuture.java:69)*
*    at toolcb.go(toolcb.java:45)*
*    at toolcb.main(toolcb.java:14)*
*Caused by: java.util.concurrent.CancellationException: Cancelled*
*    ... 4 more*
*test_key5489 true*

On Friday, 2 May 2014 23:41:57 UTC+7, Matt Ingenthron wrote:
>
>  Hi Phuc,
>
>   From: Phuc Huu <[email protected] <javascript:>>
> Reply-To: "[email protected] <javascript:>" <
> [email protected] <javascript:>>
> Date: Friday, May 2, 2014 at 3:03 AM
> To: "[email protected] <javascript:>" 
> <[email protected]<javascript:>
> >
> Subject: How does Couchbase clucter response when one nodes down???
>  
>   I'm testing Couchbase Server 2.5`. I have a cluster with 7 nodes and 3 
> replicates. In normal condition, the system works fine.
>
> But I failed with this test case: Couchbase cluster's serving 40.000 ops 
> and I stop couchbase service on one server => one node down. After that, 
> entire cluster's performance is decreased painfully. It only can server 
> below 1.000 ops. When I click fail-over then entire cluster return healthy.
>  
>  
>  There’s a lot of missing information here, like version of the cluster, 
> what client you’re using, what the workload is.
>
>  Based on what you say, I suspect your test is just running in a tight 
> loop with random keys.  Before the failover, this means it will try to open 
> the connection and it will wait some time to try to get a response from 
> that node.  The configuration is telling the client that the node should be 
> part of the cluster.  That additional latency inserted into the tight loop 
> would give you the drop in throughput.
>
>  Consider refactoring your test so the workload generation is constant 
> and you should see a drop in throughput commensurate with the small 
> reduction in nodes.
>
>  Or, if you want to more simply test this theory, just pick a few random 
> keys and hit each in their own tight loop.  When you stop the service on 
> the one node, those should maintain the same throughput.
>
>    
> Is this right behavior that Couchbase cluster response when one nodes 
> down??? Couchbase cluster will lose nearly all performance until i 
> fail-over.
>  
>  
>  I’m highly confident that Couchbase works correctly here.  I think 
> you’re seeing the drop in throughput from your workload generator hitting 
> timeouts.
>
>  Hope that helps,
>
>  Matt
>  
>  -- 
>  Matt Ingenthron
> Couchbase, Inc.
>    

-- 
You received this message because you are subscribed to the Google Groups 
"Couchbase" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: How does Couchbase clucter response when one nodes down???

Reply via email to