[ 
https://issues.apache.org/jira/browse/CASSANDRA-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072081#comment-14072081
 ] 

graham sanderson commented on CASSANDRA-7546:
---------------------------------------------

bq. It doesn't look to me like we re-copy the ranges (only the arrays we store 
them in)

Oops, yeah you are correct

{quote}
I would rather we didn't increase the amount of memory we use. In 2.1 I'm 
stricter about this, because in 2.0 we can mitigate it by replacing 
AtomicReference with a volatile and an AtomicReferenceFieldUpdater. But 
whatever we do in 2.1 has to be free memory-wise. This means we have 1 integer 
or 1 reference to play with in the outer class (not the holder), as we can get 
this for free. We don't need to maintain a size in 2.1 though, so this is easy. 
We can track the actual amount of memory allocated (since we already do this).
{quote}

I'm all for not wasting memory, after all this is what this patch is about. I'm 
not sure exactly what 2.1 has to be _free_ memory wise means... however I 
assume that the end result is that you don't want either the Atomic***Columns 
or the Holder object to grow at all (i.e. another 8 bytes), and I'm assuming 
you're calculating space based on compressedoops object layout (so we may have 
a chance to fill in a spare 32 bit value somewhere; I'll have to check the 2 
classes in 2.0 and 2.1 cases). Note the reason I'm confused about free is that 
the Object[] for the btree are on heap things and we allocate quite a lot of 
them. Perhaps by free you mean, no increase in memory usage vs today for this 
change.

bq. get the current time in ms (but from nanoTime since we need monotonicity);

Also slight confused; nanoTime is not monotonic but nanoTime minus some static 
base nanoTime is for all practical purposes, so I assume you mean this. Based 
on that I guess we can use Integer.MIN_VALUE as a "no one has wasted work yet" 
flag.

bq.  In 2.0 we multiply the number of updates we had made by by lg2(N) (N = 
current tree size), and multiple this by 100 (approximate size of snaptree 
nodes) + ~200 per clone

by number of updates do you mean individual column attempts?
which clones are you talking about - I have currently moved them outside the 
loop which allowed for pre-sharing, and for shrinking the locked work later, 
but this extra int[] is not free (unless we are only talking about retained 
space vs temporary).
I guess we should probably always round up to 1K... that would still be 100,000 
CAS fails a second which is certainly bad

Anyway, I'll double check the allocation costs in 2.0.x, use and atomic field 
updater, and make a 2.0.x patch (and see how it behaves)

Now "max rate" sounds more like something that should be exposable via config 
(though since it is an implementation detail that will go away eventually, it 
doesn't make sense to make it a per CF thing)... I'll run my test again to see 
what a good value seems to be. But yeah if something wastes 100M/s ever, I 
think we can call mark it as "special".

Note, the one question other question I have is how big can a single 
Atomic***Instance get - i.e. is it even possible to allocate 100MB in one, or 
do they turn over too fast.


> AtomicSortedColumns.addAllWithSizeDelta has a spin loop that allocates memory
> -----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7546
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7546
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: graham sanderson
>            Assignee: graham sanderson
>         Attachments: 7546.20.txt, 7546.20_2.txt, 7546.20_3.txt, 
> 7546.20_alt.txt, suggestion1.txt, suggestion1_21.txt
>
>
> In order to preserve atomicity, this code attempts to read, clone/update, 
> then CAS the state of the partition.
> Under heavy contention for updating a single partition this can cause some 
> fairly staggering memory growth (the more cores on your machine the worst it 
> gets).
> Whilst many usage patterns don't do highly concurrent updates to the same 
> partition, hinting today, does, and in this case wild (order(s) of magnitude 
> more than expected) memory allocation rates can be seen (especially when the 
> updates being hinted are small updates to different partitions which can 
> happen very fast on their own) - see CASSANDRA-7545
> It would be best to eliminate/reduce/limit the spinning memory allocation 
> whilst not slowing down the very common un-contended case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to