[ 
https://issues.apache.org/jira/browse/CASSANDRA-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akhtar Hussain updated CASSANDRA-8352:
--------------------------------------
    Since Version: 2.0.3
      Description: 
We have a Geo-red setup with 2 Data centers having 3 nodes each. When we bring 
down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, reads fail 
on DC1 with TimedOutException for a brief amount of time (15-20 sec~). 

Questions:
1.      We need to understand why reads fail on DC1 when a node in another DC 
i.e. DC2 fails? As we are using LOCAL_QUORUM for both reads/writes in DC1, 
request should return once 2 nodes in local DC have replied instead of timing 
out because of node in remote DC.
2.      We want to make sure that no Cassandra requests fail in case of node 
failures. We used rapid read protection of ALWAYS/99percentile/10ms as 
mentioned in 
http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2. But 
nothing worked. How to ensure zero request failures in case a node fails?
3.      What is the right way of handling HTimedOutException exceptions in 
Hector?
4.      Please confirm are we using public private hostnames as expected?

We are using Cassandra 2.0.3.



      Environment: Unix, Cassandra 2.0.3
           Labels: DataCenter GEO-Red  (was: )
          Summary: Strange problem regarding Cassandra nodes  (was: trange 
problem regarding Cassandra)

Exception in Application Logs:
2014-11-20 15:36:50.653 WARN  m.p.c.connection.HConnectionManager - Exception: 
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
                at 
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:286)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:269)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:132)
 [com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:290)
 [com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
 [com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
 [com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
 [com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:101)
 [com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
 [com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
com.ericsson.rm.cassandra.xa.keyspace.row.KeyedRowQuery.execute(KeyedRowQuery.java:77)
 [com.ericsson.bss.common.cassandra.xa_3.4.12.jar:na]
                at 
com.ericsson.rm.voucher.traffic.persistence.cassandra.CassandraPersistence.getRow(CassandraPersistence.java:765)
 [com.ericsson.bss.voucher.traffic.persistence.cassandra_4.7.11.jar:na]
                at 
com.ericsson.rm.voucher.traffic.persistence.cassandra.CassandraPersistence.deleteVoucher(CassandraPersistence.java:400)
 [com.ericsson.bss.voucher.traffic.persistence.cassandra_4.7.11.jar:na]
                at 
com.ericsson.rm.voucher.traffic.VoucherTraffic.commit(VoucherTraffic.java:647) 
[com.ericsson.bss.voucher.traffic_4.7.11.jar:na]
                at 
com.ericsson.bss.voucher.traffic.proxy.VoucherTrafficDeproxy.callCommit(VoucherTrafficDeproxy.java:448)
 [com.ericsson.bss.voucher.traffic.proxy_4.7.11.jar:na]
                at 
com.ericsson.bss.voucher.traffic.proxy.VoucherTrafficDeproxy.call(VoucherTrafficDeproxy.java:312)
 [com.ericsson.bss.voucher.traffic.proxy_4.7.11.jar:na]
                at 
com.ericsson.rm.cluster.router.jgroups.destination.RouterDestination$RouterMessageTask.run(RouterDestination.java:333)
 [com.ericsson.bss.common.cluster.router.jgroups_3.4.12.jar:na]
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_51]
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
                at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
Caused by: org.apache.cassandra.thrift.TimedOutException: null
                at 
org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11504)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11453)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:11379)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) 
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:653) 
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:637) 
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
                at 
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:274)
 ~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]

Exception in system logs of Cassandra
DEBUG [Thrift:4] 2014-11-20 15:36:50,652 ReadCallback.java (line 100) Read 
timeout: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed 
out - received only 5 responses.
DEBUG [Thrift:4] 2014-11-20 15:36:50,652 Tracing.java (line 159) request 
complete
TRACE [Thrift:49] 2014-11-20 15:36:50,653 AbstractReadExecutor.java (line 109) 
reading digest from /10.61.16.18
DEBUG [Thrift:4] 2014-11-20 15:36:50,653 CustomTThreadPoolServer.java (line 
204) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote side has 
closed. Tried to read 4 bytes, but only got 0 bytes. (This is often indicative 
of an internal error on the server side. Please check your server logs.)
                at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
                at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
                at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
                at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
                at 
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
                at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:194)
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:744)

Cassandra.yaml configuration on all nodes
Rpc_address: private hostname
Listen_address: public hostname
Seeds: public hostnames of all 6 nodes in both Data centers

Cassandra Topology file
host2_pub=DC1:RAC1
host3_pub=DC1:RAC1
host1_pub=DC1:RAC1
geo1_host=DC2:RAC1
geo2_host=DC2:RAC1
geo3_host=DC2:RAC1
default= DC1:RAC1 (for DC1 nodes) / default= DC2 :RAC1 (for DC2 nodes)

host<n>_pub= public hostname
geo<n>_host= public hostname of nodes in remote DC

Keyspace configuration

CREATE KEYSPACE vs WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC2': '3',
  'DC1': '3'
};

Cassandra Version: 2.0.3
Hector: 1.1.0.E001

> Strange problem regarding Cassandra nodes
> -----------------------------------------
>
>                 Key: CASSANDRA-8352
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8352
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Unix, Cassandra 2.0.3
>            Reporter: Akhtar Hussain
>              Labels: DataCenter, GEO-Red
>
> We have a Geo-red setup with 2 Data centers having 3 nodes each. When we 
> bring down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, 
> reads fail on DC1 with TimedOutException for a brief amount of time (15-20 
> sec~). 
> Questions:
> 1.    We need to understand why reads fail on DC1 when a node in another DC 
> i.e. DC2 fails? As we are using LOCAL_QUORUM for both reads/writes in DC1, 
> request should return once 2 nodes in local DC have replied instead of timing 
> out because of node in remote DC.
> 2.    We want to make sure that no Cassandra requests fail in case of node 
> failures. We used rapid read protection of ALWAYS/99percentile/10ms as 
> mentioned in 
> http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2. 
> But nothing worked. How to ensure zero request failures in case a node fails?
> 3.    What is the right way of handling HTimedOutException exceptions in 
> Hector?
> 4.    Please confirm are we using public private hostnames as expected?
> We are using Cassandra 2.0.3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to