[ 
https://issues.apache.org/jira/browse/CASSANDRA-18110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654633#comment-17654633
 ] 

Abe Ratnofsky commented on CASSANDRA-18110:
-------------------------------------------

For #3 - wouldn't disabling tracking mean that the virtual table isn't usable? 
This would put us in a situation where you can either use the vtable or stream 
successfully on larger clusters, which seems to defeat the purpose of the 
vtable.

 

I'm also in favor of #4.

 

I was also thinking of changing the pacing (via debounce) for individual file 
progress events, so a progress event does not trigger handleStreamEvent if it 
was called within the last X millis.

> Streaming fails during multiple concurrent host replacements
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-18110
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18110
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Jon Meredith
>            Assignee: Jon Meredith
>            Priority: Normal
>             Fix For: 4.1.x
>
>
> Running four concurrent host replacements on a 4.1.0 development cluster has 
> repeatably failed to complete bootstrap with all four hosts failing bootsrrap 
> and staying in JOINING, logging the message.
> {code:java}
> ERROR 2022-12-07T21:15:48,860 [main] 
> org.apache.cassandra.service.StorageService:2019 - Error while waiting on 
> bootstrap to complete. Bootstrap will have to be restarted.
> {code}
> Bootstrap fails as the the FileStreamTasks on the streaming followers 
> encounter an EOF while transmitting the files.
> {code:java}
> ERROR 2022-12-07T15:49:39,164 [NettyStreaming-Outbound-/1.2.3.4.7000:2] 
> org.apache.cassandra.streaming.StreamSession:718 - [Stream 
> #8d313690-7674-11ed-813f-95c261b64a82] Streaming error occurred on session 
> with peer 1.2.3.4:7000 through 1.2.3.4:40292
> org.apache.cassandra.net.AsyncChannelOutputPlus$FlushException: The channel 
> this output stream was writing to has been closed
>        at 
> org.apache.cassandra.net.AsyncChannelOutputPlus.propagateFailedFlush(AsyncChannelOutputPlus.java:200)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.net.AsyncChannelOutputPlus.waitUntilFlushed(AsyncChannelOutputPlus.java:158)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.net.AsyncChannelOutputPlus.waitForSpace(AsyncChannelOutputPlus.java:140)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.net.AsyncChannelOutputPlus.beginFlush(AsyncChannelOutputPlus.java:97)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.net.AsyncStreamingOutputPlus.lambda$writeToChannel$0(AsyncStreamingOutputPlus.java:124)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.lambda$write$0(CassandraCompressedStreamWriter.java:89)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.net.AsyncStreamingOutputPlus.writeToChannel(AsyncStreamingOutputPlus.java:120)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.db.streaming.CassandraCompressedStreamWriter.write(CassandraCompressedStreamWriter.java:88)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.db.streaming.CassandraOutgoingFile.write(CassandraOutgoingFile.java:177)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.streaming.messages.OutgoingStreamMessage.serialize(OutgoingStreamMessage.java:87)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:45)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.streaming.messages.OutgoingStreamMessage$1.serialize(OutgoingStreamMessage.java:34)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:39)
>  ~[cassandra.jar]
>        at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:311)
>  [cassandra.jar]
>        at 
> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) 
> [cassandra.jar]
>        at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) 
> [cassandra.jar]
>        at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) 
> [cassandra.jar]
>        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>        at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  [netty-all-4.1.58.Final.jar:4.1.58.Final]
>        at java.lang.Thread.run(Thread.java:829) [?:?]
>        Suppressed: java.nio.channels.ClosedChannelException
>                at 
> org.apache.cassandra.net.AsyncStreamingOutputPlus.doFlush(AsyncStreamingOutputPlus.java:82)
>  ~[cassandra.jar]
>                at 
> org.apache.cassandra.net.AsyncChannelOutputPlus.flush(AsyncChannelOutputPlus.java:229)
>  ~[cassandra.jar]
>                at 
> org.apache.cassandra.net.AsyncChannelOutputPlus.close(AsyncChannelOutputPlus.java:248)
>  ~[cassandra.jar]
>                at 
> org.apache.cassandra.streaming.async.NettyStreamingChannel$1.close(NettyStreamingChannel.java:141)
>  ~[cassandra.jar]
>                at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$FileStreamTask.run(StreamingMultiplexedChannel.java:312)
>  [cassandra.jar]
>                at 
> org.apache.cassandra.concurrent.FutureTask$1.call(FutureTask.java:96) 
> [cassandra.jar]
>                at 
> org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) 
> [cassandra.jar]
>                at 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) 
> [cassandra.jar]
>                at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>                at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>                at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  [netty-all-4.1.58.Final.jar:4.1.58.Final]
>                at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: io.netty.channel.unix.Errors$NativeIoException: 
> writevAddresses(..) failed: Connection timed out
> {code}
> Running a debug build with extra logging added to 
> org.apache.cassandra.streaming.StreamSession#progress shows long periods 
> where the streaming task does not log any progress events of up to 15 mins in 
> duration. After these stalls the next attempt to write hits EOF.
> Example caught by extra logging in 
> org.apache.cassandra.streaming.StreamSession#progress
> {code:java}
> INFO  2022-12-07T15:10:56,149 [NettyStreaming-Outbound-/1.2.3.4.7000:2] 
> org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 22216704/25673094 bytes (86%) 
> s
> ent to idx:0/1.2.3.4 for file /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 
> bytes 22216704 out of total 25673094
> # ... 15min delay during which other streams can be seen transmitting to the 
> same and other hosts
> INFO  2022-12-07T15:25:29,614 [NettyStreaming-Outbound-/1.2.3.4.7000:2] 
> org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 22282240/25673094 bytes (86%) 
> sent to idx:0/1.2.3.4 for file 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 22282240 out of total 
> 25673094
> INFO  2022-12-07T15:25:29,615 [NettyStreaming-Outbound-/1.2.3.4.7000:2] 
> org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 22347776/25673094 bytes (87%) 
> sent to idx:0/1.2.3.4 for file 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 22347776 out of total 
> 25673094
> INFO  2022-12-07T15:25:29,626 [NettyStreaming-Outbound-/1.2.3.4.7000:2] 
> org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 23855104/25673094 bytes (92%) 
> sent to idx:0/1.2.3.4 for file 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 23855104 out of total 
> 25673094
> INFO  2022-12-07T15:25:29,627 [NettyStreaming-Outbound-/1.2.3.4.7000:2] 
> org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 23920640/25673094 bytes (93%) 
> sent to idx:0/1.2.3.4 for file 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 23920640 out of total 
> 25673094
> # ...12min delay while other streams continue to transmit
> INFO  2022-12-07T15:37:29,775 [NettyStreaming-Outbound-/1.2.3.4.7000:2] 
> org.apache.cassandra.streaming.StreamSession:1082 - Got stream progress event 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db 23986176/25673094 bytes (93%) 
> sent to idx:0/1.2.3.4 for file 
> /datadir/ks/tbl-xxx/ks-tbl-nb-87763-big-Data.db bytes 23986176 out of total 
> 25673094
> {code}
> The issue has not succesfully been reproduced in other clusters or in a 
> dtest, but is reliable reproducible on the cluster. The hosts being used are 
> relatively large and have an available CPU count of 32 causing the streaming 
> system to allow up to 32 concurrent streams.
> [https://github.com/apache/cassandra/blob/81c616826ab3c8aa96467771d0e074a874efdd77/src/java/org/apache/cassandra/streaming/async/StreamingMultiplexedChannel.java#L93]
> One possible explanation is that the stream rate limiter is not being fair 
> and blocking sends long enough for TCP user timeout to fire and cause the 
> kernel to close the socket which is detected on the next write, another is 
> something is taking unusually long flushing the socket in 
> AsyncChannelOutputPlus.beginFlush.
> Restarting the instances with the streaming TCP user timeout disabled on the 
> cluster (set to zero in the boostrapping instances config, and updated with 
> JMX on the other instances) allowed the host-replacements to complete 
> bootstrap successfully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to