[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544187#comment-17544187
 ] 

Piotr Kolaczkowski commented on CASSANDRA-16681:
------------------------------------------------

[~benedict] it is all described in the commit message. It answers exactly your 
questions.

Quoting the relevant part of the commit message for your convenience:

1. LocalPool is meant to be managed by a single thread.
Unfortunately there was a code path that allowed
a thread enter the context of the LocalPool owned by another
thread by attempting to recycle a chunk using its
recycler reference. This led to a situation where unsynchronized
data structures were modified from multiple threads.
The patch ensures LocalPool state is modified by the owning thread
only.

The following sequence triggered the bug:
a) The app asks the BufferPool for a tiny chunk on Thread A.
We get the chunk from the thread local LocalPool A, we set
its recycler to the parent LocalPool and owner to its tinyPool.
b) Eventually the tiny chunk gets evicted. Now its owner is set
to null, but the chunk is not free yet, so nothing more happens.
c) In the meantime buffer(s) of the tiny chunk get(s)
transferred to Thread B on the client side.
d) Eventually the client releases the last buffer of the tiny chunk
by calling BufferPool#put, but it does that on Thread B.
e) BufferPool grabs a reference to the thread local LocalPool B and
calls put on it. So far all correct. We are returning the buffer
to the LocalPool B, owned by Thread B.
f) First we call free to mark appropriate freeSlots of the chunk.
It is the last buffer, so all bits turn to ones (freeSlots == -1).
The chunk has been evicted, so its owner == null.
We're not the owner, so we don't need to remove the chunk from
the micro queue (no need because it is not there) and we can
give it back to the pool we got the chunk from.
So we call chunk.tryRecycle.
g) chunk.tryRecycle realizes freeSlots == -1, CAS succeeds,
now freeSlots == 0 and the chunk needs to be recycled.
h) chunk.recycle gets called and calls chunk.recycler.recycle()
underneath. But the chunk was allocated by thread A,
so chunk.recycler points to LocalPool A owned by Thread A.
Here we enter the context of the wrong thread!
i) chunk.recycler.recycle() gets the parent chunk and its slab
and calls put again. Notice we're calling put the second time now.
We're calling it on a LocalPool A, but we are still on Thread B.
Danger!
j) LocalPool A is the owner of chunk's slab we got the buffer
from in step b). So owner == this.
k) Now, we're additionally unlucky, and it turns out this was
the last buffer of the normal chunk as well.
So we have free = 0 and owner == this, the first condition after
free is true, so we remove the chunk from the micro queue.
The micro queue is managed by Thread A, and we're
doing this on Thread B. Here we crash.

 

BTW: You will not see this path in cassandra-3.x branches, because the 
`recycler` thing and tinypool was added on cassandra-4.0. So the bug in the 
BufferPool applies to 4.0 only.

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-16681
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Ekaterina Dimitrova
>            Assignee: Piotr Kolaczkowski
>            Priority: Normal
>             Fix For: 3.11.x, 4.0.x, 4.x
>
>         Attachments: 0001-Fix-memoryInUse-counter-in-BufferPool.patch, 
> 0002-Multiple-fixes-in-BufferPool-and-LongBufferPoolTest.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to