[jira] [Comment Edited] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes

Jaydeepkumar Chovatia (Jira) Wed, 13 Nov 2024 16:50:40 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898081#comment-17898081
 ]


Jaydeepkumar Chovatia edited comment on CASSANDRA-19564 at 11/14/24 12:48 AM:
------------------------------------------------------------------------------

[~benedict] [~gauravapiscean]

Copying Memtable is going to be a very tricky thing and might lead us to more 
future corner cases, in my opinion. How about a third option, which is not to 
wait indefinitely in {_}MemtableAllocator.java{_}? The major problem is that 
the _MemtableAllocator.java_ thread has been waiting indefinitely to acquire 
the memory.  Instead of having it stay forever, we can introduce a timeout, and 
if it is unsuccessful, then let that operation fail.

That way, when _Memtable_ is almost full, the _MutationStage_ and _Compaction_ 
threads will eventually timeout and unblock *MemtableReclaimMemory.* Once the   
*_MemtableReclaimMemory_*  is unblocked, it will free up more space, and then 
the future _MutationStage_ and _Compaction_ threads will succeed.

In short, bail out the compaction and mutation tasks after some interval to 
break the deadlock. If so, the change to the MemtableAllocator.java would look 
as follows: 
{code:java}
--- a/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
+++ b/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
@@ -28,6 +28,10 @@ import com.codahale.metrics.Timer;
 import org.apache.cassandra.utils.concurrent.OpOrder;
 import org.apache.cassandra.utils.concurrent.WaitQueue;
 
+import static java.util.concurrent.TimeUnit.MILLISECONDS;
+import static java.util.concurrent.TimeUnit.SECONDS;
+import static org.apache.cassandra.utils.Clock.Global.nanoTime;
+
 public abstract class MemtableAllocator
 {
     private static final Logger logger = 
LoggerFactory.getLogger(MemtableAllocator.class);
@@ -192,7 +196,7 @@ public abstract class MemtableAllocator
                     return;
                 }
                 else
-                    signal.awaitThrowUncheckedOnInterrupt();
+                    signal.awaitUntilThrowUncheckedOnInterrupt(nanoTime() + 
SECONDS.toNanos(5)); //TODO: introduce a new timeout configuration or use the 
write timeout
             }
         }
{code}
wdyt?


was (Author: [email protected]):
[~benedict] [~gauravapiscean]

Copying Memtable is going to be a very tricky thing and might lead us to more 
future corner cases, in my opinion. How about a third option, which is not to 
wait indefinitely in {_}MemtableAllocator.java{_}? The major problem is that 
the _MemtableAllocator.java_ thread has been waiting indefinitely to acquire 
the memory.  Instead of having it stay forever, we can introduce a timeout, and 
if it is unsuccessful, then let that operation fail.

That way, when _Memtable_ is almost full, the _MutationStage_ and _Compaction_ 
threads will eventually timeout and unblock *MemtableReclaimMemory.* Once the  
** _MemtableReclaimMemory_ ** is unblocked, it will free up more space, and 
then the future _MutationStage_ and _Compaction_ threads will succeed{*}.{*}

In short, bail out the compaction and mutation tasks after some interval to 
break the deadlock. If so, the change to the MemtableAllocator.java would look 
as follows: 
{code:java}
--- a/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
+++ b/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
@@ -28,6 +28,10 @@ import com.codahale.metrics.Timer;
 import org.apache.cassandra.utils.concurrent.OpOrder;
 import org.apache.cassandra.utils.concurrent.WaitQueue;
 
+import static java.util.concurrent.TimeUnit.MILLISECONDS;
+import static java.util.concurrent.TimeUnit.SECONDS;
+import static org.apache.cassandra.utils.Clock.Global.nanoTime;
+
 public abstract class MemtableAllocator
 {
     private static final Logger logger = 
LoggerFactory.getLogger(MemtableAllocator.class);
@@ -192,7 +196,7 @@ public abstract class MemtableAllocator
                     return;
                 }
                 else
-                    signal.awaitThrowUncheckedOnInterrupt();
+                    signal.awaitUntilThrowUncheckedOnInterrupt(nanoTime() + 
SECONDS.toNanos(5)); //TODO: introduce a new timeout configuration or use the 
write timeout
             }
         }
{code}
wdyt?

> MemtablePostFlush deadlock leads to stuck nodes and crashes
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-19564
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19564
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Compaction, Local/Memtable
>            Reporter: Jon Haddad
>            Priority: Urgent
>             Fix For: 4.1.x
>
>         Attachments: image-2024-04-16-11-55-54-750.png, 
> image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png, 
> image-2024-04-16-13-53-24-455.png, image-2024-04-17-18-46-29-474.png, 
> image-2024-04-17-19-13-06-769.png, image-2024-04-17-19-14-34-344.png, 
> screenshot-1.png
>
>
> I've run into an issue on a 4.1.4 cluster where an entire node has locked up 
> due to what I believe is a deadlock in memtable flushing. Here's what I know 
> so far.  I've stitched together what happened based on conversations, logs, 
> and some flame graphs.
> *Log reports memtable flushing*
> The last successful flush happens at 12:19. 
> {noformat}
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 
> AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks', 
> ColumnFamily='version') to free up room. Used total: 0.24/0.33, live: 
> 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15
> INFO  [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012 
> - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB 
> (13%) on-heap, 790.606MiB (15%) off-heap
> {noformat}
> *MemtablePostFlush appears to be blocked*
> At this point, MemtablePostFlush completed tasks stops incrementing, active 
> stays at 1 and pending starts to rise.
> {noformat}
> MemtablePostFlush   1    1   3446   0   0
> {noformat}
>  
> The flame graph reveals that PostFlush.call is stuck.  I don't have the line 
> number, but I know we're stuck in 
> {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual 
> below:
> *!image-2024-04-16-13-43-11-064.png!*
> *Memtable flushing is now blocked.*
> All MemtableFlushWriter threads are Parked waiting on 
> {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time 
> is spent here.  Presumably we're waiting on the single threaded Post Flush.
> !image-2024-04-16-12-29-15-386.png!
> *Memtable allocations start to block*
> Eventually it looks like the NativeAllocator stops successfully allocating 
> memory. I assume it's waiting on memory to be freed, but since memtable 
> flushes are blocked, we wait indefinitely.
> Looking at a wall clock flame graph, all writer threads have reached the 
> allocation failure path of {{MemtableAllocator.allocate()}}.  I believe we're 
> waiting on {{signal.awaitThrowUncheckedOnInterrupt()}}
> {noformat}
>  MutationStage    48    828425      980253369      0    0{noformat}
> !image-2024-04-16-11-55-54-750.png!
>  
> *Compaction Stops*
> Since we write to the compaction history table, and that requires memtables, 
> compactions are now blocked as well.
>  
> !image-2024-04-16-13-53-24-455.png!
>  
> The node is now doing basically nothing and must be restarted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-19564) MemtablePostFlush deadlock leads to stuck nodes and crashes

Reply via email to