[
https://issues.apache.org/jira/browse/CASSANDRA-13058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798508#comment-15798508
]
Stefan Podkowinski commented on CASSANDRA-13058:
------------------------------------------------
This looks like a regression caused by CASSANDRA-12192 to me. The test seems to
work fine without the commit.
Take a look at this dtest log:
{{./run_dtests.py --vnodes false --nose-options -v
hintedhandoff_test.py:TestHintedHandoff.hintedhandoff_decom_test}}
{noformat}
Node 4:
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:16,748
StorageService.java:1437 - DRAINING: starting drain process
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:20,011
StorageService.java:1437 - DRAINED
Node 1:
INFO [GossipStage:1] 2017-01-04 15:06:16,756 Gossiper.java:1036 - InetAddress
/127.0.0.4 is now DOWN
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 -
Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 -
Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 -
Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 FailureDetector.java:325 -
Forcing conviction of /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,758 Gossiper.java:370 - /127.0.0.4
marked as alive: false
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:16,758
OutboundTcpConnection.java:374 - Socket to /127.0.0.4 closed
DEBUG [MessagingService-Outgoing-/127.0.0.4-Small] 2017-01-04 15:06:16,759
OutboundTcpConnection.java:374 - Socket to /127.0.0.4 closed
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,883
OutboundTcpConnection.java:388 - Attempting to connect to /127.0.0.4
INFO [HANDSHAKE-/127.0.0.4] 2017-01-04 15:06:17,884
OutboundTcpConnection.java:510 - Handshaking version with /127.0.0.4
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,884
OutboundTcpConnection.java:482 - Done connecting to /127.0.0.4
DEBUG [Native-Transport-Requests-3] 2017-01-04 15:06:20,825
FailureDetector.java:252 - /127.0.0.4 is ALIVE
INFO [RMI TCP Connection(2)-127.0.0.1] 2017-01-04 15:06:52,930
StorageService.java:1435 - LEAVING: streaming hints to other nodes
INFO [HintsDispatcher:2] 2017-01-04 15:06:52,955
HintsDispatchExecutor.java:152 - Transferring all hints to
054006b5-412c-4d4e-a28a-7610574de79 d
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:53,035
OutboundTcpConnection.java:495 - Unable to connect to /127.0.0.4
java.net.ConnectException: Connection refused
▸ at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_112]
▸ at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_112]
▸ at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_112]
▸ at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
~[na:1.8.0_112]
▸ at
org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146)
~[main/:na]
▸ at
org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132)
~[main/:na]
▸ at
org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:397)
[main/:na]
▸ at
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:234)
[main/:na]
{noformat}
I've added some additional log messages, so don't get confused when you can't
find some of them in the code. What's happening here is that the gossiper will
down node1 and will trigger {{OutboundTcpConnectionPool.reset()}} via
{{MessagingService.convict()}}. This will cause a reconnect shortly, as the
message should now always be retried from the backlog again.
Unfortunately I can't see which message is affected here. But my assumption is
that node4 keeps accepting messages even for a short time after gossip change
propagation and will return a response to node1, which will flag node4 alive
again in the process. But I'm not exactly sure, as the node status related code
is not easy to understand with all the side effects and unclear state handling.
:(
/cc [~thobbs]
> dtest failure in hintedhandoff_test.TestHintedHandoff.hintedhandoff_decom_test
> ------------------------------------------------------------------------------
>
> Key: CASSANDRA-13058
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13058
> Project: Cassandra
> Issue Type: Test
> Components: Testing
> Reporter: Sean McCarthy
> Priority: Blocker
> Labels: dtest, test-failure
> Fix For: 3.10
>
> Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log,
> node2_debug.log, node2_gc.log, node3.log, node3_debug.log, node3_gc.log,
> node4.log, node4_debug.log, node4_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.X_novnode_dtest/16/testReport/hintedhandoff_test/TestHintedHandoff/hintedhandoff_decom_test/
> {code}
> Error Message
> Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', ['decommission']]
> exited with non-zero status; exit status: 2;
> stderr: error: Error while decommissioning node: Failed to transfer all hints
> to 59f20b4f-0215-4e18-be1b-7e00f2901629
> {code}{code}
> Stacktrace
> File "/usr/lib/python2.7/unittest/case.py", line 329, in run
> testMethod()
> File "/home/automaton/cassandra-dtest/hintedhandoff_test.py", line 167, in
> hintedhandoff_decom_test
> node1.decommission()
> File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1314, in
> decommission
> self.nodetool("decommission")
> File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 783, in
> nodetool
> return handle_external_tool_process(p, ['nodetool', '-h', 'localhost',
> '-p', str(self.jmx_port), cmd.split()])
> File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1993, in
> handle_external_tool_process
> raise ToolError(cmd_args, rc, out, err)
> {code}{code}
> java.lang.RuntimeException: Error while decommissioning node: Failed to
> transfer all hints to 59f20b4f-0215-4e18-be1b-7e00f2901629
> at
> org.apache.cassandra.service.StorageService.decommission(StorageService.java:3924)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
> at
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
> at
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
> at
> com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
> at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
> at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
> at
> com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
> at
> javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466)
> at
> javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
> at
> javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307)
> at
> javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399)
> at
> javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:828)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323)
> at sun.rmi.transport.Transport$1.run(Transport.java:200)
> at sun.rmi.transport.Transport$1.run(Transport.java:197)
> at java.security.AccessController.doPrivileged(Native Method)
> at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
> at
> sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$241(TCPTransport.java:683)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$284/1694175644.run(Unknown
> Source)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)