[
https://issues.apache.org/jira/browse/CASSANDRA-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765344#comment-17765344
]
Brandon Williams commented on CASSANDRA-18792:
----------------------------------------------
It turns out the message is actually missing, there are just two similar
messages, one
[here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/Gossiper.java#L1465]:
"restart, now UP" which we have, but we need the one
[here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/Gossiper.java#L1411]:
"is now UP" which occurs in realMarkAlive after the node has responded to an
echo request. In our failure case, the echo request is failing and never seems
to be retried, which seems like another bug.
> Test failure:
> transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-18792
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18792
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/python
> Reporter: Andres de la Peña
> Assignee: Brandon Williams
> Priority: Normal
> Fix For: 5.0.x, 5.x
>
>
> The Python dtest {{Test failure:
> transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup}}
> seems to be flaky at least in {{trunk}}:
> *
> https://app.circleci.com/pipelines/github/instaclustr/cassandra/2993/workflows/80ac4db3-fc3d-4908-bc39-dfff6ab88871/jobs/105464/tests
> *
> https://app.circleci.com/pipelines/github/adelapena/cassandra/3128/workflows/b0cf2754-81fd-491e-bac4-cc7fe8b0ac1b/jobs/70390/tests
> {code}
> ccmlib.node.ToolError: Subprocess ['nodetool', '-h', 'localhost', '-p',
> '7200', 'cleanup'] exited with non-zero status; exit status: 2;
> stderr: error: Node is involved in cluster membership changes. Not safe to
> run cleanup.
> -- StackTrace --
> java.lang.RuntimeException: Node is involved in cluster membership changes.
> Not safe to run cleanup.
> at
> org.apache.cassandra.service.StorageService.forceKeyspaceCleanup(StorageService.java:4037)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72)
> at jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> at java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:262)
> at
> java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
> at
> java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
> at
> java.management/com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
> at
> java.management/com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
> at
> java.management/com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
> at
> java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:814)
> at
> java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:802)
> at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1472)
> at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1310)
> at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1405)
> at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> at
> java.rmi/sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:360)
> at java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:200)
> at java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:197)
> at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
> at java.rmi/sun.rmi.transport.Transport.serviceCall(Transport.java:196)
> at
> java.rmi/sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:587)
> at
> java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828)
> at
> java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:705)
> at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:399)
> at
> java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:704)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> self = <transient_replication_ring_test.TestTransientReplicationRing object
> at 0x7f6d38295710>
> @flaky(max_runs=1)
> @pytest.mark.no_vnodes
> def test_move_forwards_between_and_cleanup(self):
> """Test moving a node forwards past a neighbor token"""
> move_token = '00025'
> expected_after_move = [gen_expected(range(0, 26), range(31, 40, 2)),
> gen_expected(range(0, 21, 2), range(31, 40)),
> gen_expected(range(1, 11, 2), range(11, 21,
> 2), range(21, 31)),
> gen_expected(range(21, 26, 2), range(26, 40))]
> expected_after_repair = [gen_expected(range(0, 26)),
> gen_expected(range(0, 21), range(31, 40)),
> gen_expected(range(21, 31),),
> gen_expected(range(26, 40))]
> > self.move_test(move_token, expected_after_move, expected_after_repair)
> transient_replication_ring_test.py:291:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _
> transient_replication_ring_test.py:268: in move_test
> cleanup_nodes(nodes)
> transient_replication_ring_test.py:43: in cleanup_nodes
> node.nodetool('cleanup')
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:1018: in nodetool
> return handle_external_tool_process(p, ['nodetool', '-h', 'localhost',
> '-p', str(self.jmx_port)] + shlex.split(cmd))
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _
> process = <subprocess.Popen object at 0x7f6d376055c0>
> cmd_args = ['nodetool', '-h', 'localhost', '-p', '7200', 'cleanup']
> def handle_external_tool_process(process, cmd_args):
> out, err = process.communicate()
> if (out is not None) and isinstance(out, bytes):
> out = out.decode()
> if (err is not None) and isinstance(err, bytes):
> err = err.decode()
> rc = process.returncode
>
> if rc != 0:
> > raise ToolError(cmd_args, rc, out, err)
> E ccmlib.node.ToolError: Subprocess ['nodetool', '-h', 'localhost',
> '-p', '7200', 'cleanup'] exited with non-zero status; exit status: 2;
> E stderr: error: Node is involved in cluster membership changes.
> Not safe to run cleanup.
> E -- StackTrace --
> E java.lang.RuntimeException: Node is involved in cluster
> membership changes. Not safe to run cleanup.
> E at
> org.apache.cassandra.service.StorageService.forceKeyspaceCleanup(StorageService.java:4037)
> E at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> E at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> E at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72)
> E at jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown
> Source)
> E at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> E at
> java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:262)
> E at
> java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
> E at
> java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
> E at
> java.management/com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
> E at
> java.management/com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
> E at
> java.management/com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
> E at
> java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:814)
> E at
> java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:802)
> E at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1472)
> E at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1310)
> E at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1405)
> E at
> java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
> E at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
> E at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.base/java.lang.reflect.Method.invoke(Method.java:568)
> E at
> java.rmi/sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:360)
> E at
> java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:200)
> E at
> java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:197)
> E at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
> E at
> java.rmi/sun.rmi.transport.Transport.serviceCall(Transport.java:196)
> E at
> java.rmi/sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:587)
> E at
> java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:828)
> E at
> java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:705)
> E at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:399)
> E at
> java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:704)
> E at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> E at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> E at java.base/java.lang.Thread.run(Thread.java:833)
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:2318: ToolError
> {code}
> This hasn't been seen yet on Butler but on ad-hoc PRs. It can also be
> reproduced with the multiplexer:
> {code}
> .circleci/generate.sh -p \
> -e REPEATED_DTESTS_COUNT=500 \
> -e
> REPEATED_DTESTS=transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_between_and_cleanup
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]