[
https://issues.apache.org/jira/browse/CASSANDRA-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559999#comment-14559999
]
Benedict commented on CASSANDRA-8897:
-------------------------------------
bq. The cas on the buffer attachment is to protect against two threads
simultaneously freeing the same buffer. Without it we have two unit tests that
trigger the assertion in free(): testMultipleThreadsReleaseDifferentBuffer()
and testMultipleThreadsReleaseSameBuffer(). Shall I remove the unit tests and
accept we have a race or restore the cas?
If we double-free, it's a bug, and should be an assertion failure, not a silent
success. If we want to guard against this, the owner should have an idempotent
behaviour on its close method, IMO.
bq. Also the chunks are not recycled right now because the owner is not null
but I can add this.
My expectation was that you would rewrite this whole method, with the tighter
bound on time spent executing. I must admit at the time (I believe I left a
comment as such) I thought they were already not being recycled in your
implementation, but it looks like this was due to another change I made to your
code, so I apologise for that confusion. However the recycling you were doing
was racy, which leads onto...
bq. Your latest suggestions have reverted this:
Are you referring to the strategy of leaving the chunk in the queue and
allocating from it if we have no room up front, but there are segments left?
There were a few problems with this approach:
# The order of visitation on allocate meant we could leave a lot of memory
unused
# The cost of removal was linear, and relatively expensive being a CLQ
# The number of elements we could end up with on the queue was unbounded,
worsening this
# There are a number of possible race conditions that could have lead to a
corrupted pool state, which is I believe the main reason I removed it (and then
forgot :)). For instance, a thread could be executing maybeRecycle, have
performed the check to confirm it should recycle, and then suspend. The
allocating thread could then alliocate a buffer, immediately recycle it (all
before the other thread wakes up), and then also execute maybeRecycle; both
would then successfully add it back to the global pool, and then two different
threads can have it simultaneously as their main allocation block. I haven't
thought about if this would lead to any really problematic scenarios, but it is
certainly undesirable.
If we have a bounded number of chunks we're allocating from, we will still be
able to exploit this if any of these chunks have their buffers freed while
we're still using them. We can also in a follow up ticket introduce a special
pool of buffers we are not actively allocating from, but which we can do if we
run out of room in the global pool and our usual avenue, but I think we've
probably done enough for this ticket now.
bq. The burn test reports this sort of errors with if (index == 64) in get():
I'll have a closer look at this shortly. The fact that it's occurring in the
burn test suggests it may not be the problem, but that it may be making another
problem visible
> Remove FileCacheService, instead pooling the buffers
> ----------------------------------------------------
>
> Key: CASSANDRA-8897
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8897
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Assignee: Stefania
> Fix For: 3.x
>
> Attachments: 9240_test_results.txt,
> snapshot-1431582436640-cpu-backtraces.png,
> snapshot-1431582436640-cpu-calltree-compression-8897.nps,
> snapshot-1431582436640-cpu-calltree-compression-trunk.nps
>
>
> After CASSANDRA-8893, a RAR will be a very lightweight object and will not
> need caching, so we can eliminate this cache entirely. Instead we should have
> a pool of buffers that are page-aligned.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)