[
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034971#comment-16034971
]
Matt Byrd commented on CASSANDRA-11748:
---------------------------------------
So I believe the crux of this problem is:
On startup if our schema differs from one of the nodes on the same version of
messaging service as ourselves, we'll pull the schema from said node upon
marking it as UP, hence with a large cluster and a large schema, we're pulling
many copies of the serialised schema onto the heap, potentially causing
pressure on the heap and eventually OOMS.
To make matters worse, when we hit the GC pauses this seems to result in the
other nodes being marked as DOWN and then UP again, pulling the schema once
again.
As a result the instance OOMs on startup, with a large enough schema and
cluster this is probably deterministic.
This can happen when a node is down for a while and has missed a schema change,
or if the given upgrade path results in a schema version change, which somehow
is not reflected quickly enough locally, then maybeScheduleSchemaPull runs and
decides to pull it remotely.
When you startup and see hundreds of nodes all with the same schema version
that you need it probably doesn't make much sense to pull it from every single
one of them, if instead we just limit the number of schema migration tasks in
flight, we can limit or stop this behaviour from occuring.
I've got a patch which does just this and fixes a dtest reproduction I've
written.
I had some other variants that limited the number of in-flight tasks per schema
version for example, but it seemed that a straightforward limit was sufficient.
Admittedly I'm not certain that the upgrade problem still exists, but starting
a node without the latest schema should still cause this problem.
There is a expiry to the limit to avoid getting stuck in a state where the
counter for inflight isn't decremented properly (which during testing I found
can occur, whenever a message sent via messaging fails to even be sent
properly, hence neither the failure nor success callback is ever called).
I'll attach some links shortly.
> Schema version mismatch may leads to Casandra OOM at bootstrap during a
> rolling upgrade process
> -----------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-11748
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
> Project: Cassandra
> Issue Type: Bug
> Environment: Rolling upgrade process from 1.2.19 to 2.0.17.
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
> Reporter: Michael Fong
> Assignee: Matt Byrd
> Priority: Critical
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17.
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any
> of node could run into OOM.
> The following is the system.log that occur in one of our 2-node cluster test
> bed
> ----------------------------------
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326
> MigrationManager.java (line 328) Gossiping my schema version
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122
> MigrationManager.java (line 328) Gossiping my schema version
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2,
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328)
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2 keeps submitting the migration task over 100+ times to the other
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414)
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> ----------------------------------
> On the otherhand, Node 1 keeps updating its gossip information, followed by
> receiving and submitting migrationTask afterwards:
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496
> MigrationRequestVerbHandler.java (line 41) Received migration request from
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line
> 127) submitting migration task for /192.168.88.34
> ..... (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have
> InternalResponseStage performing schema merge operation. Since this operation
> requires a compaction for each merge and is much slower to consume. Thus, the
> back-pressure of incoming schema migration content objects consumes all of
> the heap space and ultimately ends up OOM!
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]