[ 
https://issues.apache.org/jira/browse/CASSANDRA-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129137#comment-16129137
 ] 

Matt Byrd commented on CASSANDRA-11748:
---------------------------------------

Hi [~mcfongtw],
Hopefully I'm interpreting your comments correctly, I believe it's further 
analysis of the particular problem rather than suggestions for improvement?
Firstly I agree that having an unbounded number of concurrent migration tasks 
is the root of the problem (along with the other pre-condition of having a 
suitably large schema and somehow missing a schema update, either being down or 
being on another major version from where the change took place):
{quote}
1. Have migration checks and requests fired asynchronously and finally stack up 
the all message at the receiver end merge the schema one-by-one at
{code}
Schema.instance.mergeSchemaAndAnnounceVersion()
{code}
{quote}
Rather than trying to de-dupe the schema mutations at the receiver end (which 
might help reduce how much is retained on the heap but ultimately doesn't get 
at the heart of the problem).

{quote}
2. Send the receiver the complete copy of schema, instead of delta copy of 
schema out of diff between two nodes.
{quote}
Sending the whole copy of the schema came into play here: 
https://issues.apache.org/jira/browse/CASSANDRA-1391.
I believe reverting this behaviour is probably out of scope of any 3.0 update, 
but perhaps for a future patch we can negotiate the delta rather than sending 
the whole schema. This would be a good improvement, but I don't think it's 
strictly necessary for solving this particular problem.

{quote}
3. Last but not least, the most mysterious problem that leads to OOM and we 
could not figure out why back then, is that there are hundreds of migration 
task all fired nearly simultaneously, within 2 s. The number of rpcs does not 
match with the nodes in cluster, but is close to number of second taken for the 
node to reboot.
{quote}
It's possible there is something else going on in addition here, although one 
thing that I've observed (as mentioned above) is that due to all the heap 
pressure from the large mutations sent concurrently, the node itself can pause 
for several seconds and hence both be marked as DOWN by the remote nodes and 
mark those remote nodes DOWN itself, followed by then marking them UP and doing 
a another schema pull as a result. This spiral often results in many more 
migration tasks than are necessary, before either OOMing out or finally 
applying the required schema change.
If you still have your logs you could check roughly how many on on UP messages 
for other endpoints occurred on a problematic instance and compare that to the 
number of migration tasks.

At any rate I believe either rate limiting the migration tasks either globally 
or per schema version or indeed coming up with an alternative mechanism which 
serialises the schema pulls should address the problem.

I'll take a look at a proposition by [~iamaleksey] to pass the information 
about schema versions to a map and move the actual triggering of pull requests 
onto a frequently run periodic task, which reads this map and decides an 
appropriate course of action to resolve the schema difference. (This way we can 
collect all this information arriving asynchronously and for example 
de-duplicate repeated calls for the same schema version for different 
endpoints.) 
I think the main advantage of such an approach (as opposed to limiting the 
number of migration tasks by schema version) is that it removes the possibility 
of ending up with a stale schema due to the limiting, however it's worth noting 
that doing the limit per schema version and expiring the limits, already goes a 
long way to reducing this possibility. I'll try and dig up that version of the 
patch for reference/comparison.
[~iamaleksey], [~spod] Please let me know if there is anything in particular 
about the way you want this to behave or you feel I've misrepresented the idea 
in any way. One further thing that did occur to me was that trying to balance 
avoiding superfluous schema pulls against ensuring we converge as quickly as 
possible, might necessitate some degree of parallelism. For example if we pick 
a node to pull schema from and it's partitioned off, so we don't hear back for 
a while(or ever), under such a scenario we probably want to be proactively 
scheduling a pull from elsewhere to avoid waiting too long for a timeout. I'm 
sure there will be some other details to work out too but I think the general 
approach makes sense.

> Schema version mismatch may leads to Casandra OOM at bootstrap during a 
> rolling upgrade process
> -----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11748
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11748
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Rolling upgrade process from 1.2.19 to 2.0.17. 
> CentOS 6.6
> Occurred in different C* node of different scale of deployment (2G ~ 5G)
>            Reporter: Michael Fong
>            Assignee: Matt Byrd
>            Priority: Critical
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>
> We have observed multiple times when a multi-node C* (v2.0.17) cluster ran 
> into OOM in bootstrap during a rolling upgrade process from 1.2.19 to 2.0.17. 
> Here is the simple guideline of our rolling upgrade process
> 1. Update schema on a node, and wait until all nodes to be in schema version 
> agreemnt - via nodetool describeclulster
> 2. Restart a Cassandra node
> 3. After restart, there is a chance that the the restarted node has different 
> schema version.
> 4. All nodes in cluster start to rapidly exchange schema information, and any 
> of node could run into OOM. 
> The following is the system.log that occur in one of our 2-node cluster test 
> bed
> ----------------------------------
> Before rebooting node 2:
> Node 1: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,326 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> Node 2: DEBUG [MigrationStage:1] 2016-04-19 11:09:42,122 
> MigrationManager.java (line 328) Gossiping my schema version 
> 4cb463f8-5376-3baf-8e88-a5cc6a94f58f
> After rebooting node 2, 
> Node 2: DEBUG [main] 2016-04-19 11:18:18,016 MigrationManager.java (line 328) 
> Gossiping my schema version f5270873-ba1f-39c7-ab2e-a86db868b09b
> The node2  keeps submitting the migration task over 100+ times to the other 
> node.
> INFO [GossipStage:1] 2016-04-19 11:18:18,261 Gossiper.java (line 1011) Node 
> /192.168.88.33 has restarted, now UP
> INFO [GossipStage:1] 2016-04-19 11:18:18,262 TokenMetadata.java (line 414) 
> Updating topology for /192.168.88.33
> ...
> DEBUG [GossipStage:1] 2016-04-19 11:18:18,265 MigrationManager.java (line 
> 102) Submitting migration task for /192.168.88.33
> ... ( over 100+ times)
> ----------------------------------
> On the otherhand, Node 1 keeps updating its gossip information, followed by 
> receiving and submitting migrationTask afterwards: 
> INFO [RequestResponseStage:3] 2016-04-19 11:18:18,333 Gossiper.java (line 
> 978) InetAddress /192.168.88.34 is now UP
> ...
> DEBUG [MigrationStage:1] 2016-04-19 11:18:18,496 
> MigrationRequestVerbHandler.java (line 41) Received migration request from 
> /192.168.88.34.
> …… ( over 100+ times)
> DEBUG [OptionalTasks:1] 2016-04-19 11:19:18,337 MigrationManager.java (line 
> 127) submitting migration task for /192.168.88.34
> .....  (over 50+ times)
> On the side note, we have over 200+ column families defined in Cassandra 
> database, which may related to this amount of rpc traffic.
> P.S.2 The over requested schema migration task will eventually have 
> InternalResponseStage performing schema merge operation. Since this operation 
> requires a compaction for each merge and is much slower to consume. Thus, the 
> back-pressure of incoming schema migration content objects consumes all of 
> the heap space and ultimately ends up OOM!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to