[
https://issues.apache.org/jira/browse/CASSANDRA-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksey Yeschenko updated CASSANDRA-10693:
------------------------------------------
Issue Type: Improvement (was: Bug)
> Bootstrapping needs improved error handling for disk failure on existing node
> -----------------------------------------------------------------------------
>
> Key: CASSANDRA-10693
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10693
> Project: Cassandra
> Issue Type: Improvement
> Components: Streaming and Messaging
> Environment: Cassandra 2.1.11
> Cluster of 2 or 3 xeon servers (8 core, 24 GB RAM each)
> Reporter: Heiko Sommer
>
> (This is the original problem description. Later the title of this ticket was
> changed and new findings were added as a comment.)
> I'm playing with a test cluster, trying to add a 3rd server while writing to
> the cluster at max rate (executeAsync with pushing back). Single table for
> time series data, with RF=2, STCS.
> The bootstrap failure that I repeatedly get may be the same as in
> CASSANDRA-9440 (or also CASSANDRA-6565).
> Even if I could join the 3rd node using temporarily modified configuation
> ({{streaming_socket_timeout_in_ms}}, {{phi_convict_threshold}} etc), I think
> that at least improving the logs for this failure case could be worthwhile.
> When starting the new node (134.171.18.228), the underlying problem seems to
> happen very early in the bootstrapping, on one of the existing cluster nodes
> (134.171.18.231):
> {noformat}
> INFO [STREAM-INIT-/134.171.18.228:45331] 2015-11-11 16:21:30,344
> StreamResultFuture.java:109 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc
> ID#0] Creating new streaming plan for Bootstrap
> INFO [STREAM-INIT-/134.171.18.228:45331] 2015-11-11 16:21:30,344
> StreamResultFuture.java:116 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc,
> ID#0] Received streaming plan for Bootstrap
> INFO [STREAM-INIT-/134.171.18.228:45333] 2015-11-11 16:21:30,350
> StreamResultFuture.java:116 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc,
> ID#0] Received streaming plan for Bootstrap
> ERROR [STREAM-IN-/134.171.18.228] 2015-11-11 16:21:41,505
> StreamSession.java:505 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Streaming error occurred
> java.lang.AssertionError: Illegal bounds [2270144..2270152); size: 2270144
> at org.apache.cassandra.io.util.Memory.checkBounds(Memory.java:328)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.io.util.SafeMemory.checkBounds(SafeMemory.java:98)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at org.apache.cassandra.io.util.Memory.getLong(Memory.java:249)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.io.compress.CompressionMetadata.getChunksForSections(CompressionMetadata.java:253)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.streaming.messages.OutgoingFileMessage.<init>(OutgoingFileMessage.java:70)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.streaming.StreamTransferTask.addTransferFile(StreamTransferTask.java:58)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.streaming.StreamSession.addTransferFiles(StreamSession.java:378)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.streaming.StreamSession.addTransferRanges(StreamSession.java:272)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:521)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:453)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:256)
> ~[apache-cassandra-2.1.11.jar:2.1.11]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> INFO [STREAM-IN-/134.171.18.228] 2015-11-11 16:21:41,510
> StreamResultFuture.java:180 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Session with /134.171.18.228 is complete
> WARN [STREAM-IN-/134.171.18.228] 2015-11-11 16:21:41,511
> StreamResultFuture.java:207 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Stream failed
> {noformat}
> On the joining node, this failure is not reported at all:
> {noformat}
> INFO [main] 2015-11-11 16:21:30,000 StorageService.java:1120 - JOINING:
> Starting to bootstrap...
> INFO [main] 2015-11-11 16:21:30,336 StreamResultFuture.java:86 - [Stream
> #43deef20-8890-11e5-a633-c16ac7efddbc] Executing streaming plan for Bootstrap
> INFO [StreamConnectionEstablisher:2] 2015-11-11 16:21:30,337
> StreamSession.java:220 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Starting streaming to /134.171.18.229
> INFO [StreamConnectionEstablisher:1] 2015-11-11 16:21:30,337
> StreamSession.java:220 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Starting streaming to /134.171.18.231
> INFO [StreamConnectionEstablisher:1] 2015-11-11 16:21:30,348
> StreamCoordinator.java:209 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc,
> ID#0] Beginning stream session with /134.171.18.231
> INFO [StreamConnectionEstablisher:2] 2015-11-11 16:21:30,348
> StreamCoordinator.java:209 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc,
> ID#0] Beginning stream session with /134.171.18.229
> INFO [STREAM-IN-/134.171.18.231] 2015-11-11 16:21:41,508
> StreamResultFuture.java:180 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Session with /134.171.18.231 is complete
> INFO [STREAM-IN-/134.171.18.229] 2015-11-11 16:21:42,639
> StreamResultFuture.java:166 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc
> ID#0] Prepare completed. Receiving 66 files(120500434815 bytes), sending 0
> files(0 bytes)
> {noformat}
> However, ~9 hours later, the bootstrapping fails, with logs:
> {noformat}
> INFO [StreamReceiveTask:8] 2015-11-12 01:25:26,837
> StreamResultFuture.java:180 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Session with /134.171.18.229 is complete
> WARN [StreamReceiveTask:8] 2015-11-12 01:25:26,839
> StreamResultFuture.java:207 - [Stream #43deef20-8890-11e5-a633-c16ac7efddbc]
> Stream failed
> ERROR [main] 2015-11-12 01:25:26,843 CassandraDaemon.java:579 - Exception
> encountered during startup
> java.lang.RuntimeException: Error during boostrap: Stream failed
> at
> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:86)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1148)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:938)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:734)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:611)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:387)
> [apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:562)
> [apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:651)
> [apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
> at
> org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:208)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:184)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:415)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:688)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:651)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:139)
> ~[apache-cassandra-2.1.11-SNAPSHOT.jar:2.1.11-SNAPSHOT]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_40]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[na:1.8.0_40]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_40]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_40]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_40]
> WARN [StorageServiceShutdownHook] 2015-11-12 01:25:26,855 Gossiper.java:1453
> - No local state or state is in silent shutdown, not announcing shutdown
> {noformat}
> I hacked {{StreamState#hasFailedSession()}} to log all failed sessions, and
> it showed exactly one to the node that had the "AssertionError: Illegal
> bounds" before.
> It would be great to get some help with the following questions:
> * What could cause the "AssertionError: Illegal bounds"? I'm now running a
> cleanup just in case and will re-run later.
> * Is the streaming failure with one of the existing cluster nodes considered
> so unimportant (because other nodes are supposed to cover this) that it
> should not be logged on the bootstrapping node? In that case the failure
> should not let the bootstrapping fail, I suppose.
> * Is it normal that bootstrapping can take 9 hours, or am I doing something
> fundamentally wrong?
> * OpsCenter 5.2.2 shows my new node for the whole time under "Unknown
> Datacenter", even though "nodetool status" shows it as joining the existing
> datacenter. Is that an indication of a problem in my setup, or just the way
> that OpsCenter displays joining nodes?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)