[jira] [Commented] (CASSANDRA-8897) Remove FileCacheService, instead pooling the buffers

Benedict (JIRA) Tue, 26 May 2015 15:28:24 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559999#comment-14559999
 ]


Benedict commented on CASSANDRA-8897:
-------------------------------------

bq. The cas on the buffer attachment is to protect against two threads 
simultaneously freeing the same buffer. Without it we have two unit tests that 
trigger the assertion in free(): testMultipleThreadsReleaseDifferentBuffer() 
and testMultipleThreadsReleaseSameBuffer(). Shall I remove the unit tests and 
accept we have a race or restore the cas?

If we double-free, it's a bug, and should be an assertion failure, not a silent 
success. If we want to guard against this, the owner should have an idempotent 
behaviour on its close method, IMO.

bq. Also the chunks are not recycled right now because the owner is not null 
but I can add this.

My expectation was that you would rewrite this whole method, with the tighter 
bound on time spent executing. I must admit at the time (I believe I left a 
comment as such) I thought they were already not being recycled in your 
implementation, but it looks like this was due to another change I made to your 
code, so I apologise for that confusion. However the recycling you were doing 
was racy, which leads onto...

bq. Your latest suggestions have reverted this:

Are you referring to the strategy of leaving the chunk in the queue and 
allocating from it if we have no room up front, but there are segments left? 
There were a few problems with this approach:

# The order of visitation on allocate meant we could leave a lot of memory 
unused
# The cost of removal was linear, and relatively expensive being a CLQ
# The number of elements we could end up with on the queue was unbounded, 
worsening this
# There are a number of possible race conditions that could have lead to a 
corrupted pool state, which is I believe the main reason I removed it (and then 
forgot :)). For instance, a thread could be executing maybeRecycle, have 
performed the check to confirm it should recycle, and then suspend. The 
allocating thread could then alliocate a buffer, immediately recycle it (all 
before the other thread wakes up), and then also execute maybeRecycle; both 
would then successfully add it back to the global pool, and then two different 
threads can have it simultaneously as their main allocation block. I haven't 
thought about if this would lead to any really problematic scenarios, but it is 
certainly undesirable.

If we have a bounded number of chunks we're allocating from, we will still be 
able to exploit this if any of these chunks have their buffers freed while 
we're still using them. We can also in a follow up ticket introduce a special 
pool of buffers we are not actively allocating from, but which we can do if we 
run out of room in the global pool and our usual avenue, but I think we've 
probably done enough for this ticket now. 

bq. The burn test reports this sort of errors with if (index == 64) in get():

I'll have a closer look at this shortly. The fact that it's occurring in the 
burn test suggests it may not be the problem, but that it may be making another 
problem visible

> Remove FileCacheService, instead pooling the buffers
> ----------------------------------------------------
>
>                 Key: CASSANDRA-8897
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8897
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: 9240_test_results.txt, 
> snapshot-1431582436640-cpu-backtraces.png, 
> snapshot-1431582436640-cpu-calltree-compression-8897.nps, 
> snapshot-1431582436640-cpu-calltree-compression-trunk.nps
>
>
> After CASSANDRA-8893, a RAR will be a very lightweight object and will not 
> need caching, so we can eliminate this cache entirely. Instead we should have 
> a pool of buffers that are page-aligned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8897) Remove FileCacheService, instead pooling the buffers

Reply via email to