[
https://issues.apache.org/jira/browse/CASSANDRA-8352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Akhtar Hussain updated CASSANDRA-8352:
--------------------------------------
Since Version: 2.0.3
Description:
We have a Geo-red setup with 2 Data centers having 3 nodes each. When we bring
down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>, reads fail
on DC1 with TimedOutException for a brief amount of time (15-20 sec~).
Questions:
1. We need to understand why reads fail on DC1 when a node in another DC
i.e. DC2 fails? As we are using LOCAL_QUORUM for both reads/writes in DC1,
request should return once 2 nodes in local DC have replied instead of timing
out because of node in remote DC.
2. We want to make sure that no Cassandra requests fail in case of node
failures. We used rapid read protection of ALWAYS/99percentile/10ms as
mentioned in
http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2. But
nothing worked. How to ensure zero request failures in case a node fails?
3. What is the right way of handling HTimedOutException exceptions in
Hector?
4. Please confirm are we using public private hostnames as expected?
We are using Cassandra 2.0.3.
Environment: Unix, Cassandra 2.0.3
Labels: DataCenter GEO-Red (was: )
Summary: Strange problem regarding Cassandra nodes (was: trange
problem regarding Cassandra)
Exception in Application Logs:
2014-11-20 15:36:50.653 WARN m.p.c.connection.HConnectionManager - Exception:
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
at
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:286)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:269)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:132)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.getSlice(KeyspaceServiceImpl.java:290)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:53)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery$1.doInKeyspace(ThriftSliceQuery.java:49)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:101)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.model.thrift.ThriftSliceQuery.execute(ThriftSliceQuery.java:48)
[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
com.ericsson.rm.cassandra.xa.keyspace.row.KeyedRowQuery.execute(KeyedRowQuery.java:77)
[com.ericsson.bss.common.cassandra.xa_3.4.12.jar:na]
at
com.ericsson.rm.voucher.traffic.persistence.cassandra.CassandraPersistence.getRow(CassandraPersistence.java:765)
[com.ericsson.bss.voucher.traffic.persistence.cassandra_4.7.11.jar:na]
at
com.ericsson.rm.voucher.traffic.persistence.cassandra.CassandraPersistence.deleteVoucher(CassandraPersistence.java:400)
[com.ericsson.bss.voucher.traffic.persistence.cassandra_4.7.11.jar:na]
at
com.ericsson.rm.voucher.traffic.VoucherTraffic.commit(VoucherTraffic.java:647)
[com.ericsson.bss.voucher.traffic_4.7.11.jar:na]
at
com.ericsson.bss.voucher.traffic.proxy.VoucherTrafficDeproxy.callCommit(VoucherTrafficDeproxy.java:448)
[com.ericsson.bss.voucher.traffic.proxy_4.7.11.jar:na]
at
com.ericsson.bss.voucher.traffic.proxy.VoucherTrafficDeproxy.call(VoucherTrafficDeproxy.java:312)
[com.ericsson.bss.voucher.traffic.proxy_4.7.11.jar:na]
at
com.ericsson.rm.cluster.router.jgroups.destination.RouterDestination$RouterMessageTask.run(RouterDestination.java:333)
[com.ericsson.bss.common.cluster.router.jgroups_3.4.12.jar:na]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_51]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_51]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
Caused by: org.apache.cassandra.thrift.TimedOutException: null
at
org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11504)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
org.apache.cassandra.thrift.Cassandra$get_slice_result$get_slice_resultStandardScheme.read(Cassandra.java:11453)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:11379)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:653)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:637)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$7.execute(KeyspaceServiceImpl.java:274)
~[com.ericsson.bss.common.hector-client_3.4.12.jar:na]
Exception in system logs of Cassandra
DEBUG [Thrift:4] 2014-11-20 15:36:50,652 ReadCallback.java (line 100) Read
timeout: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 5 responses.
DEBUG [Thrift:4] 2014-11-20 15:36:50,652 Tracing.java (line 159) request
complete
TRACE [Thrift:49] 2014-11-20 15:36:50,653 AbstractReadExecutor.java (line 109)
reading digest from /10.61.16.18
DEBUG [Thrift:4] 2014-11-20 15:36:50,653 CustomTThreadPoolServer.java (line
204) Thrift transport error occurred during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote side has
closed. Tried to read 4 bytes, but only got 0 bytes. (This is often indicative
of an internal error on the server side. Please check your server logs.)
at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
at
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:194)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Cassandra.yaml configuration on all nodes
Rpc_address: private hostname
Listen_address: public hostname
Seeds: public hostnames of all 6 nodes in both Data centers
Cassandra Topology file
host2_pub=DC1:RAC1
host3_pub=DC1:RAC1
host1_pub=DC1:RAC1
geo1_host=DC2:RAC1
geo2_host=DC2:RAC1
geo3_host=DC2:RAC1
default= DC1:RAC1 (for DC1 nodes) / default= DC2 :RAC1 (for DC2 nodes)
host<n>_pub= public hostname
geo<n>_host= public hostname of nodes in remote DC
Keyspace configuration
CREATE KEYSPACE vs WITH replication = {
'class': 'NetworkTopologyStrategy',
'DC2': '3',
'DC1': '3'
};
Cassandra Version: 2.0.3
Hector: 1.1.0.E001
> Strange problem regarding Cassandra nodes
> -----------------------------------------
>
> Key: CASSANDRA-8352
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8352
> Project: Cassandra
> Issue Type: Bug
> Environment: Unix, Cassandra 2.0.3
> Reporter: Akhtar Hussain
> Labels: DataCenter, GEO-Red
>
> We have a Geo-red setup with 2 Data centers having 3 nodes each. When we
> bring down a single Cassandra node down in DC2 by kill -9 <Cassandra-pid>,
> reads fail on DC1 with TimedOutException for a brief amount of time (15-20
> sec~).
> Questions:
> 1. We need to understand why reads fail on DC1 when a node in another DC
> i.e. DC2 fails? As we are using LOCAL_QUORUM for both reads/writes in DC1,
> request should return once 2 nodes in local DC have replied instead of timing
> out because of node in remote DC.
> 2. We want to make sure that no Cassandra requests fail in case of node
> failures. We used rapid read protection of ALWAYS/99percentile/10ms as
> mentioned in
> http://www.datastax.com/dev/blog/rapid-read-protection-in-cassandra-2-0-2.
> But nothing worked. How to ensure zero request failures in case a node fails?
> 3. What is the right way of handling HTimedOutException exceptions in
> Hector?
> 4. Please confirm are we using public private hostnames as expected?
> We are using Cassandra 2.0.3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)