[ 
https://issues.apache.org/jira/browse/CASSANDRA-16949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769646#comment-17769646
 ] 

Jacek Lewandowski commented on CASSANDRA-16949:
-----------------------------------------------

Ha! I've managed to reproduce it quite reliably! Just added a short sleep in 
the {{SimpleClient#maybeWrite}} loop.

[~samt], I admit that my previous explanation was wrong.

When the messages are being sent, they are split into frames. The corrupted 
frame is not the last one the test tries to send. Also, those messages are sent 
asynchronously.

The test expects the connection to be closed by the server, and the client can 
no longer read from the channel. As a result, we get an exception, which marks 
the connection as closed. In such a case, we can expect the custom error 
message to appear in the buffer. 

However, there is a race - the client may fail due to the inability to send 
some remaining frames. This can be clearly seen when we log the error message 
from the exception handling code:

expected behaviour:
{noformat}
ERROR [TEST-CLIENT:1] 2023-09-27 15:25:31,673 CQLConnectionTest.java:1059 - 
Connection exception: 
java.io.IOException: Connection reset by peer
        at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)
        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:233)
        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:223)
        at 
java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356)
        at 
io.netty.buffer.UnpooledDirectByteBuf.setBytes(UnpooledDirectByteBuf.java:570)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
        at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}

unexpected behaviour (with sleep in flush loop on the client side):
{noformat}
ERROR [TEST-CLIENT:1] 2023-09-27 15:31:01,896 CQLConnectionTest.java:1059 - 
Connection exception: 
java.io.IOException: Connection reset by peer
        at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
        at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:58)
        at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:50)
        at 
java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:462)
        at 
io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:415)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:931)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:895)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:921)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
        at 
io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
        at 
io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:941)
        at 
io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:966)
        at 
io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:934)
        at 
org.apache.cassandra.transport.SimpleClient$SimpleFlusher.writeLargeMessage(SimpleClient.java:820)
        at 
org.apache.cassandra.transport.SimpleClient$SimpleFlusher.maybeWrite(SimpleClient.java:744)
        at 
org.apache.cassandra.transport.SimpleClient$SimpleFlusher.lambda$schedule$0(SimpleClient.java:722)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:159)
        at 
io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}
 
In particular, I presume that the failure scenario is as follows:
1. C sends the correct frames
2. C sends the corrupted frame
3. S receives the corrupted frame, sends the error message, and closes the 
connection
4. C sends the correct frames
5. OS responds with RST as the socket is already closed
6. C receives RST before anything else (before the error message and legitimate 
connection close request from the S)
7. C closes the connection


> Fix flaky test org.apache.cassandra.transport.CQLConnectionTest
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-16949
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16949
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: David Capwell
>            Assignee: Jacek Lewandowski
>            Priority: Normal
>             Fix For: 5.x
>
>
> https://app.circleci.com/pipelines/github/dcapwell/cassandra/1037/workflows/c728d370-49b9-41aa-bdfb-8c41cf0355d8/jobs/6577/parallel-runs/3
> {code}
> [junit-timeout] Testsuite: org.apache.cassandra.transport.CQLConnectionTest
> [junit-timeout] Testsuite: org.apache.cassandra.transport.CQLConnectionTest 
> Tests run: 9, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 12.926 sec
> [junit-timeout] 
> [junit-timeout] Testcase: 
> handleCorruptionOfLargeMessageFrame(org.apache.cassandra.transport.CQLConnectionTest):
>       FAILED
> [junit-timeout] null
> [junit-timeout] junit.framework.AssertionFailedError
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testFrameCorruption(CQLConnectionTest.java:484)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testFrameCorruption(CQLConnectionTest.java:446)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.handleCorruptionOfLargeMessageFrame(CQLConnectionTest.java:217)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Testcase: 
> testNegativeEnvelopeBodySize(org.apache.cassandra.transport.CQLConnectionTest):
>      FAILED
> [junit-timeout] expected:<[]0L> but was:<[52424]0L>
> [junit-timeout] junit.framework.AssertionFailedError: expected:<[]0L> but 
> was:<[52424]0L>
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest$AllocationObserver.lambda$verifier$0(CQLConnectionTest.java:850)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testFrameCorruption(CQLConnectionTest.java:492)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testFrameCorruption(CQLConnectionTest.java:446)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testNegativeEnvelopeBodySize(CQLConnectionTest.java:326)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Testcase: 
> testUnrecoverableMessageDecodingErrors(org.apache.cassandra.transport.CQLConnectionTest):
>    FAILED
> [junit-timeout] expected:<[]0L> but was:<[52424]0L>
> [junit-timeout] junit.framework.AssertionFailedError: expected:<[]0L> but 
> was:<[52424]0L>
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>  Method)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest$AllocationObserver.lambda$verifier$0(CQLConnectionTest.java:850)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testFrameCorruption(CQLConnectionTest.java:492)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testUnrecoverableMessageDecodingErrors(CQLConnectionTest.java:392)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Testcase: 
> testRecoverableEnvelopeDecodingErrors(org.apache.cassandra.transport.CQLConnectionTest):
>     FAILED
> [junit-timeout] null
> [junit-timeout] junit.framework.AssertionFailedError
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.runTest(CQLConnectionTest.java:430)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testEnvelopeDecodingErrors(CQLConnectionTest.java:292)
> [junit-timeout]       at 
> org.apache.cassandra.transport.CQLConnectionTest.testRecoverableEnvelopeDecodingErrors(CQLConnectionTest.java:264)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> [junit-timeout]       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [junit-timeout] 
> [junit-timeout] 
> [junit-timeout] Test org.apache.cassandra.transport.CQLConnectionTest FAILED
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to