[ 
https://issues.apache.org/jira/browse/CASSANDRA-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-10693:
------------------------------------------
    Issue Type: Improvement  (was: Bug)

> Bootstrapping needs improved error handling for disk failure on existing node
> -----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10693
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10693
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Streaming and Messaging
>         Environment: Cassandra 2.1.11
> Cluster of 2 or 3 xeon servers (8 core, 24 GB RAM each)
>            Reporter: Heiko Sommer
>
> (This is the original problem description. Later the title of this ticket was 
> changed and new findings were added as a comment.)
> I'm playing with a test cluster, trying to add a 3rd server while writing to 
> the cluster at max rate (executeAsync with pushing back). Single table for 
> time series data, with RF=2, STCS.
> The bootstrap failure that I repeatedly get may be the same as in 
> CASSANDRA-9440 (or also CASSANDRA-6565).
> Even if I could join the 3rd node using temporarily modified configuation 
> ({{streaming_socket_timeout_in_ms}}, {{phi_convict_threshold}} etc), I think 
> that at least improving the logs for this failure case could be worthwhile.
> When starting the new node (134.171.18.228), the underlying problem seems to 
> happen very early in the bootstrapping, on one of the existing cluster nodes 
> (134.171.18.231):
> {noformat}
> INFO  [STREAM-INIT-/134.171.18.228:45331] 2015-11-11 16:21:30,344 
> StreamResultFuture.java:109 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc 
> ID#0] Creating new streaming plan for Bootstrap
> INFO  [STREAM-INIT-/134.171.18.228:45331] 2015-11-11 16:21:30,344 
> StreamResultFuture.java:116 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc, 
> ID#0] Received streaming plan for Bootstrap
> INFO  [STREAM-INIT-/134.171.18.228:45333] 2015-11-11 16:21:30,350 
> StreamResultFuture.java:116 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc, 
> ID#0] Received streaming plan for Bootstrap
> ERROR [STREAM-IN-/134.171.18.228] 2015-11-11 16:21:41,505 
> StreamSession.java:505 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Streaming error occurred
> java.lang.AssertionError: Illegal bounds [2270144..2270152); size: 2270144
>       at org.apache.cassandra.io.util.Memory.checkBounds(Memory.java:328) 
> ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.io.util.SafeMemory.checkBounds(SafeMemory.java:98) 
> ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at org.apache.cassandra.io.util.Memory.getLong(Memory.java:249) 
> ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.io.compress.CompressionMetadata.getChunksForSections(CompressionMetadata.java:253)
>  ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.streaming.messages.OutgoingFileMessage.<init>(OutgoingFileMessage.java:70)
>  ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.streaming.StreamTransferTask.addTransferFile(StreamTransferTask.java:58)
>  ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.streaming.StreamSession.addTransferFiles(StreamSession.java:378)
>  ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.streaming.StreamSession.addTransferRanges(StreamSession.java:272)
>  ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:521) 
> ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:453)
>  ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at 
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:256)
>  ~[apache-cassandra-2.1.11.jar:2.1.11]
>       at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> INFO  [STREAM-IN-/134.171.18.228] 2015-11-11 16:21:41,510 
> StreamResultFuture.java:180 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Session with /134.171.18.228 is complete
> WARN  [STREAM-IN-/134.171.18.228] 2015-11-11 16:21:41,511 
> StreamResultFuture.java:207 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Stream failed
> {noformat}
> On the joining node, this failure is not reported at all:
> {noformat}
> INFO  [main] 2015-11-11 16:21:30,000 StorageService.java:1120 - JOINING: 
> Starting to bootstrap...
> INFO  [main] 2015-11-11 16:21:30,336 StreamResultFuture.java:86 - [Stream 
> #43deef20-8890-11e5-a633-c16ac7efddbc] Executing streaming plan for Bootstrap
> INFO  [StreamConnectionEstablisher:2] 2015-11-11 16:21:30,337 
> StreamSession.java:220 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Starting streaming to /134.171.18.229
> INFO  [StreamConnectionEstablisher:1] 2015-11-11 16:21:30,337 
> StreamSession.java:220 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Starting streaming to /134.171.18.231
> INFO  [StreamConnectionEstablisher:1] 2015-11-11 16:21:30,348 
> StreamCoordinator.java:209 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc, 
> ID#0] Beginning stream session with /134.171.18.231
> INFO  [StreamConnectionEstablisher:2] 2015-11-11 16:21:30,348 
> StreamCoordinator.java:209 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc, 
> ID#0] Beginning stream session with /134.171.18.229
> INFO  [STREAM-IN-/134.171.18.231] 2015-11-11 16:21:41,508 
> StreamResultFuture.java:180 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Session with /134.171.18.231 is complete
> INFO  [STREAM-IN-/134.171.18.229] 2015-11-11 16:21:42,639 
> StreamResultFuture.java:166 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc 
> ID#0] Prepare completed. Receiving 66 files(120500434815 bytes), sending 0 
> files(0 bytes)
> {noformat}
> However, ~9 hours later, the bootstrapping fails, with logs:
> {noformat}
> INFO  [StreamReceiveTask:8] 2015-11-12 01:25:26,837 
> StreamResultFuture.java:180 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Session with /134.171.18.229 is complete
> WARN  [StreamReceiveTask:8] 2015-11-12 01:25:26,839 
> StreamResultFuture.java:207 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc] 
> Stream failed
> ERROR [main] 2015-11-12 01:25:26,843 CassandraDaemon.java:579 - Exception 
> encountered during startup
> java.lang.RuntimeException: Error during boostrap: Stream failed
>         at 
> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86) 
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1148)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:938)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:734)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:611)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:387) 
> [apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:562)
>  [apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:651) 
> [apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
>         at 
> org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:415)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:688)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:651)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:139)
>  ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_40]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_40]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  ~[na:1.8.0_40]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  ~[na:1.8.0_40]
>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> WARN  [StorageServiceShutdownHook] 2015-11-12 01:25:26,855 Gossiper.java:1453 
> - No local state or state is in silent shutdown, not announcing shutdown
> {noformat}
> I hacked {{StreamState#hasFailedSession()}} to log all failed sessions, and 
> it showed exactly one to the node that had the "AssertionError: Illegal 
> bounds" before. 
> It would be great to get some help with the following questions:
> * What could cause the "AssertionError: Illegal bounds"? I'm now running a 
> cleanup just in case and will re-run later.
> * Is the streaming failure with one of the existing cluster nodes considered 
> so unimportant (because other nodes are supposed to cover this) that it 
> should not be logged on the bootstrapping node? In that case the failure 
> should not let the bootstrapping fail, I suppose.
> * Is it normal that bootstrapping can take 9 hours, or am I doing something 
> fundamentally wrong?
> * OpsCenter 5.2.2 shows my new node for the whole time under "Unknown 
> Datacenter", even though "nodetool status" shows it as joining the existing 
> datacenter. Is that an indication of a problem in my setup, or just the way 
> that OpsCenter displays joining nodes?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to