[jira] [Comment Edited] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes

Benedict Elliott Smith (Jira) Fri, 22 Nov 2024 05:46:19 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900376#comment-17900376
 ]


Benedict Elliott Smith edited comment on CASSANDRA-19564 at 11/22/24 1:44 PM:
------------------------------------------------------------------------------

The change looks reasonable to me, but I won't have time to shepherd this 
change through. It should also be noted that this could have some detrimental 
GC/heap impact for some workloads, if the pages for indexing are large. I don't 
have a particular opinion about whether this should be examined before merging.

 

Looking at it a bit more quickly, I think the likelihood of detrimental impact 
is minimal - we only copy to the heap native/offheap memtable entries, all 
other datums are already considered safe - we just might hold on to memtable 
slabs slightly longer, and sstable datums are allocated to heap buffers 
already. So we just materialise the rows at once, which should be fine if 
paging is reasonable.

I do question slightly whether it's fine that we leak direct references to slab 
buffers when reading from Memtables, but... we do. Whether we are fine with 
that for this patch given it may block and hold onto them and cause heap 
pressure is a separate question, that I don't have a strong opinion on. We 
could explicitly copy the buffer to new heap buffers to avoid this issue.

I hope that we will retire slab allocators for most workloads before too long 
anyway, so I have minimal feeling about this particular edge case of our 
behaviour.


was (Author: benedict):
The change looks reasonable to me, but I won't have time to shepherd this 
change through. It should also be noted that this could have some detrimental 
GC/heap impact for some workloads, if the pages for indexing are large. I don't 
have a particular opinion about whether this should be examined before merging.

> MemtablePostFlush deadlock leads to stuck nodes and crashes
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-19564
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19564
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction, Local/Memtable
>            Reporter: Jon Haddad
>            Assignee: Runtian Liu
>            Priority: Urgent
>             Fix For: 4.1.x
>
>         Attachments: image-2024-04-16-11-55-54-750.png, 
> image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, 
> image-2024-04-16-13-53-24-455.png, image-2024-04-17-18-46-29-474.png, 
> image-2024-04-17-19-13-06-769.png, image-2024-04-17-19-14-34-344.png, 
> screenshot-1.png
>
>
> I've run into an issue on a 4.1.4 cluster where an entire node has locked up 
> due to what I believe is a deadlock in memtable flushing. Here's what I know 
> so far.  I've stitched together what happened based on conversations, logs, 
> and some flame graphs.
> *Log reports memtable flushing*
> The last successful flush happens at 12:19. 
> {noformat}
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 
> AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', 
> ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: 
> 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 
> - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB 
> (13%) on-heap, 790.606MiB (15%) off-heap
> {noformat}
> *MemtablePostFlush appears to be blocked*
> At this point, MemtablePostFlush completed tasks stops incrementing, active 
> stays at 1 and pending starts to rise.
> {noformat}
> MemtablePostFlush   1    1   3446   0   0
> {noformat}
>  
> The flame graph reveals that PostFlush.call is stuck.  I don't have the line 
> number, but I know we're stuck in 
> {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual 
> below:
> *!image-2024-04-16-13-43-11-064.png!*
> *Memtable flushing is now blocked.*
> All MemtableFlushWriter threads are Parked waiting on 
> {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time 
> is spent here.  Presumably we're waiting on the single threaded Post Flush.
> !image-2024-04-16-12-29-15-386.png!
> *Memtable allocations start to block*
> Eventually it looks like the NativeAllocator stops successfully allocating 
> memory. I assume it's waiting on memory to be freed, but since memtable 
> flushes are blocked, we wait indefinitely.
> Looking at a wall clock flame graph, all writer threads have reached the 
> allocation failure path of {{MemtableAllocator.allocate()}}.  I believe we're 
> waiting on {{signal.awaitThrowUncheckedOnInterrupt()}}
> {noformat}
>  MutationStage    48    828425      980253369      0    0{noformat}
> !image-2024-04-16-11-55-54-750.png!
>  
> *Compaction Stops*
> Since we write to the compaction history table, and that requires memtables, 
> compactions are now blocked as well.
>  
> !image-2024-04-16-13-53-24-455.png!
>  
> The node is now doing basically nothing and must be restarted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes

Reply via email to