[ https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072081#comment-14072081 ]
graham sanderson commented on CASSANDRA-7546: --------------------------------------------- bq. It doesn't look to me like we re-copy the ranges (only the arrays we store them in) Oops, yeah you are correct {quote} I would rather we didn't increase the amount of memory we use. In 2.1 I'm stricter about this, because in 2.0 we can mitigate it by replacing AtomicReference with a volatile and an AtomicReferenceFieldUpdater. But whatever we do in 2.1 has to be free memory-wise. This means we have 1 integer or 1 reference to play with in the outer class (not the holder), as we can get this for free. We don't need to maintain a size in 2.1 though, so this is easy. We can track the actual amount of memory allocated (since we already do this). {quote} I'm all for not wasting memory, after all this is what this patch is about. I'm not sure exactly what 2.1 has to be _free_ memory wise means... however I assume that the end result is that you don't want either the Atomic***Columns or the Holder object to grow at all (i.e. another 8 bytes), and I'm assuming you're calculating space based on compressedoops object layout (so we may have a chance to fill in a spare 32 bit value somewhere; I'll have to check the 2 classes in 2.0 and 2.1 cases). Note the reason I'm confused about free is that the Object[] for the btree are on heap things and we allocate quite a lot of them. Perhaps by free you mean, no increase in memory usage vs today for this change. bq. get the current time in ms (but from nanoTime since we need monotonicity); Also slight confused; nanoTime is not monotonic but nanoTime minus some static base nanoTime is for all practical purposes, so I assume you mean this. Based on that I guess we can use Integer.MIN_VALUE as a "no one has wasted work yet" flag. bq. In 2.0 we multiply the number of updates we had made by by lg2(N) (N = current tree size), and multiple this by 100 (approximate size of snaptree nodes) + ~200 per clone by number of updates do you mean individual column attempts? which clones are you talking about - I have currently moved them outside the loop which allowed for pre-sharing, and for shrinking the locked work later, but this extra int[] is not free (unless we are only talking about retained space vs temporary). I guess we should probably always round up to 1K... that would still be 100,000 CAS fails a second which is certainly bad Anyway, I'll double check the allocation costs in 2.0.x, use and atomic field updater, and make a 2.0.x patch (and see how it behaves) Now "max rate" sounds more like something that should be exposable via config (though since it is an implementation detail that will go away eventually, it doesn't make sense to make it a per CF thing)... I'll run my test again to see what a good value seems to be. But yeah if something wastes 100M/s ever, I think we can call mark it as "special". Note, the one question other question I have is how big can a single Atomic***Instance get - i.e. is it even possible to allocate 100MB in one, or do they turn over too fast. > AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory > ----------------------------------------------------------------------------- > > Key: CASSANDRA-7546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7546 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: graham sanderson > Assignee: graham sanderson > Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, > 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt > > > In order to preserve atomicity, this code attempts to read, clone/update, > then CAS the state of the partition. > Under heavy contention for updating a single partition this can cause some > fairly staggering memory growth (the more cores on your machine the worst it > gets). > Whilst many usage patterns don't do highly concurrent updates to the same > partition, hinting today, does, and in this case wild (order(s) of magnitude > more than expected) memory allocation rates can be seen (especially when the > updates being hinted are small updates to different partitions which can > happen very fast on their own) - see CASSANDRA-7545 > It would be best to eliminate/reduce/limit the spinning memory allocation > whilst not slowing down the very common un-contended case. -- This message was sent by Atlassian JIRA (v6.2#6252)