[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070533#comment-14070533 ]
graham sanderson commented on CASSANDRA-7546: --------------------------------------------- Well that makes sense, I hadn't checked if there was a limit on mutator threads - we didn't change it... this probably explains the hard upper bound in my synthetic test (which incidentally does not to the transformation) I agree with you on SnapTreeMap, once I see the "essentially" free clone operation has to acquire a lock (or at least wait for no mutations)... I surmised there were probably dragons there that might cause all kinds of nastyness whether it be pain on concurrent updates to a horribly unbalanced tree, or dragging huge amounts of garbage with it due to overly lazy copy on write (again I didn't look too closely). BTree looks much better (and probably does less rebalancing since it has wider nodes I think), though as discussed it doesn't prevent the underlying race. So, I'll see if I have time to work on this later today, but the plan is... for 2.0.x (just checking) a) move the transformation.apply out of the loop and do it once b) do a one way flip flag per AtomicSortedColumns instance, which is flipped when a cost reaches a certain value. I was going to calculate the delta in each mutator thread (probably adding a log-like measure e.g. using Integer.numberOfLeadingZeros(tree.size()) per failing CAS), though looking ugh at SnapTreeMap again, it seems that tree.size() is not a good method to call in the presence of mutations, so I guess Holder can just track the tree size itself c) given this is possibly a temporary solution, is it worth exposing the "cut-off" value even un-documented such that it could be overridden in cassandra.yaml? Note the default should be such that most AtomicSortedColumns instance never get cut-off since they are not heavily contended and large (indicating contended inserts not updates) > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > ----------------------------------------------------------------------------- > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: graham sanderson > Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_alt.txt, > suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)