[jira] [Commented] (CASSANDRA-7696) Fail to reconnect to other nodes after intermittent network failure (SSL + auth enabled)

Jeremy Jongsma (JIRA) Tue, 05 Aug 2014 13:31:38 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086703#comment-14086703
 ]


Jeremy Jongsma commented on CASSANDRA-7696:
-------------------------------------------

I saw something similar over the weekend, also running 1.2.16 with internode 
encryption and authentication in two datacenters. Other observations:

1) In DC1, all nodes reported that three nodes in DC2 were UNREACHABLE in 
"nodetool describecluster"
2) In DC2, all nodes reported that all other nodes were up and reachable, 
including the problem nodes in DC1.
3) DataStax OpsCenter (running in DC2) reported no connectivity issues, all 
nodes online.
4) I verified network connectivity was fine between all nodes.
5) Restarting Cassandra on DC1 nodes immediately restored connectivity.

> Fail to reconnect to other nodes after intermittent network failure (SSL + 
> auth enabled)
> ----------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7696
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7696
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Cassandra 1.2.16, CentOS 6.5, Cluster spread to 3 
> datacenters with 1 node each. SSL enabled between Cassandra nodes and also to 
> clients, authentization enabled.
>            Reporter: Jiri Horky
>
> We experienced a strange problem after intermittent network failure when
> the affected node did not reconnect to the rest of the cluster but did
> allow to autenticate users (which was not possible during the actual
> network outage, see below). The cluster consists of 1 node in each of 3
> datacenters, it uses C* 1.2.16 with SSL enabled both to clients and
> between C* nodes. The authentication is enabled as well.
> The problem started around 2014-08-01 when Cassandra first noticed a
> network problem:
>  INFO [GossipTasks:1] 2014-08-01 07:47:52,618 Gossiper.java (line 823)
> InetAddress /A.B.D.20 is now DOWN
>  INFO [GossipTasks:1] 2014-08-01 07:47:55,619 Gossiper.java (line 823)
> InetAddress mia10/A.B.C.20 is now DOWN
> The network came up for a while:
> INFO [GossipStage:1] 2014-08-01 07:51:29,380 Gossiper.java (line 809)
> InetAddress /A.B.C.20 is now UP
>  INFO [HintedHandoff:1] 2014-08-01 07:51:29,381
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20
>  INFO [GossipStage:1] 2014-08-01 07:51:29,381 Gossiper.java (line 809)
> InetAddress /A.B.D.20 is now UP
>  INFO [HintedHandoff:2] 2014-08-01 07:51:29,385
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20
> But it failed to send hints:
>  INFO [HintedHandoff:1] 2014-08-01 07:51:39,389
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.C.20; aborting (0 delivered)
>  INFO [HintedHandoff:2] 2014-08-01 07:51:39,390
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.D.20; aborting (0 delivered)
> Also, the log started to be "flooded" with failed autentication tries.
> My understanding is that authentication data are read with QUORUM which
> failed as the other two nodes were down:
> ERROR [Native-Transport-Requests:446116] 2014-08-01 07:51:39,985
> QueryMessage.java (line 97) Unexpected error during query
> com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.RuntimeException:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
> out - received only 0 responses.
>         at
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
>         at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
>         at
> com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
>         at
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
>         at
> org.apache.cassandra.service.ClientState.authorize(ClientState.java:292)
>         at
> org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:172)
>         at
> org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:165)
>         at
> org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:149)
>         at
> org.apache.cassandra.cql3.statements.SelectStatement.checkAccess(SelectStatement.java:116)
>         at
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:102)
>         at
> org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:113)
>         at
> org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:87)
>         at
> org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287)
>         at
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>         at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>         at
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>         at
> org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43)
>         at
> org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:722)
> Caused by: java.lang.RuntimeException:
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
> out - received only 0 responses.
>         at org.apache.cassandra.auth.Auth.selectUser(Auth.java:256)
>         at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84)
>         at
> org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50)
>         at
> org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68)
>         at
> org.apache.cassandra.service.ClientState$1.load(ClientState.java:278)
>         at
> org.apache.cassandra.service.ClientState$1.load(ClientState.java:275)
>         at
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
>         at
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
>         at
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
>         at
> com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
>         ... 20 more
> Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation timed out - received only 0 responses.
>         at
> org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105)
>         at
> org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930)
>         at
> org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815)
>         at
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140)
>         at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245)
>         ... 29 more
> These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not
> reapper after that), and cassandra seemed  to be happy as it marked the
> other two nodes as up for a while. But after a few tens of seconds, it
> marked them as DOWN again and kept it in this state until restart three
> days later, even the network connectivity was stable:
>  INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
> InetAddress /A.B.C.20 is now UP
>  INFO [HintedHandoff:2] 2014-08-01 09:02:37,305
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20
>  INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
> InetAddress /A.B.D.20 is now UP
>  INFO [HintedHandoff:1] 2014-08-01 09:02:37,308
> HintedHandOffManager.java (line 296) Started hinted handoff for host:
> 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20
>  INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java
> (line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968
> serialized/live bytes, 32 ops)
>  INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line
> 398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live
> bytes, 32 ops)
>  INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line
> 443) Completed flushing; nothing needed to be retained.  Commitlog
> position was ReplayPosition(segmentId=1403712545417, position=20482758)
>  INFO [HintedHandoff:1] 2014-08-01 09:02:47,312
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.D.20; aborting (0 delivered)
>  INFO [HintedHandoff:2] 2014-08-01 09:02:47,312
> HintedHandOffManager.java (line 427) Timed out replaying hints to
> /A.B.C.20; aborting (0 delivered)
>  INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823)
> InetAddress /A.B.D.20 is now DOWN
>  INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823)
> InetAddress mia10/A.B.C.20 is now DOWN
> After one day, the node started to discards hints it saved for the other
> nodes.
>  WARN [OptionalTasks:1] 2014-08-02 09:09:10,277
> HintedHandoffMetrics.java (line 78) /A.B.D.20 has 9039 dropped
> hints, because node is down past configured hint window.
>  WARN [OptionalTasks:1] 2014-08-02 09:09:10,290
> HintedHandoffMetrics.java (line 78) /A.B.C.20 has 8816 dropped
> hints, because node is down past configured hint window.
> The "nodetool status" marked the other two nodes as down. After restart,
> it reconnected to the cluster without any problem.
> What puzzles me is the fact that the authentization apparently started
> to work after the network recovered but the exchange of data did not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7696) Fail to reconnect to other nodes after intermittent network failure (SSL + auth enabled)

Reply via email to