[
https://issues.apache.org/jira/browse/CASSANDRA-19564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17898081#comment-17898081
]
Jaydeepkumar Chovatia edited comment on CASSANDRA-19564 at 11/14/24 12:47 AM:
------------------------------------------------------------------------------
[~benedict] [~gauravapiscean]
Copying Memtable is going to be a very tricky thing and might lead us to more
future corner cases, in my opinion. How about a third option, which is not to
wait indefinitely in {_}MemtableAllocator.java{_}? The major problem is that
the _MemtableAllocator.java_ thread has been waiting indefinitely to acquire
the memory. Instead of having it stay forever, we can introduce a timeout, and
if it is unsuccessful, then let that operation fail.
That way, when _Memtable_ is almost full, the _MutationStage_ and _Compaction_
threads will eventually timeout and unblock *MemtableReclaimMemory.* Once the
** _MemtableReclaimMemory_ ** is unblocked, it will free up more space, and
then the future _MutationStage_ and _Compaction_ threads will succeed{*}.{*}
In short, bail out the compaction and mutation tasks after some interval to
break the deadlock. If so, the change to the MemtableAllocator.java would look
as follows:
{code:java}
--- a/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
+++ b/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
@@ -28,6 +28,10 @@ import com.codahale.metrics.Timer;
import org.apache.cassandra.utils.concurrent.OpOrder;
import org.apache.cassandra.utils.concurrent.WaitQueue;
+import static java.util.concurrent.TimeUnit.MILLISECONDS;
+import static java.util.concurrent.TimeUnit.SECONDS;
+import static org.apache.cassandra.utils.Clock.Global.nanoTime;
+
public abstract class MemtableAllocator
{
private static final Logger logger =
LoggerFactory.getLogger(MemtableAllocator.class);
@@ -192,7 +196,7 @@ public abstract class MemtableAllocator
return;
}
else
- signal.awaitThrowUncheckedOnInterrupt();
+ signal.awaitUntilThrowUncheckedOnInterrupt(nanoTime() +
SECONDS.toNanos(5)); //TODO: introduce a new timeout configuration or use the
write timeout
}
}
{code}
wdyt?
was (Author: [email protected]):
[~benedict] [~gauravapiscean]
Copying Memtable is going to be a very tricky thing, in my opinion. How about a
third option, which is not to wait indefinitely in
{_}MemtableAllocator.java{_}? The major problem is that the
_MemtableAllocator.java_ thread has been waiting indefinitely to acquire the
memory. Instead of having it stay forever, we can introduce a timeout, and if
it is unsuccessful, then let that operation fail.
That way, when _Memtable_ is almost full, the _MutationStage_ and _Compaction_
threads will eventually timeout and unblock *MemtableReclaimMemory.* Once the *
\{*}_MemtableReclaimMemory_ ** is unblocked, it will free up more space, and
then the future _MutationStage_ and _Compaction_ threads will succeed{*}.\{*}
In short, bail out the compaction and mutation tasks after some interval to
break the deadlock. If so, the change to the MemtableAllocator.java would look
as follows:
{code:java}
--- a/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
+++ b/src/java/org/apache/cassandra/utils/memory/MemtableAllocator.java
@@ -28,6 +28,10 @@ import com.codahale.metrics.Timer;
import org.apache.cassandra.utils.concurrent.OpOrder;
import org.apache.cassandra.utils.concurrent.WaitQueue;
+import static java.util.concurrent.TimeUnit.MILLISECONDS;
+import static java.util.concurrent.TimeUnit.SECONDS;
+import static org.apache.cassandra.utils.Clock.Global.nanoTime;
+
public abstract class MemtableAllocator
{
private static final Logger logger =
LoggerFactory.getLogger(MemtableAllocator.class);
@@ -192,7 +196,7 @@ public abstract class MemtableAllocator
return;
}
else
- signal.awaitThrowUncheckedOnInterrupt();
+ signal.awaitUntilThrowUncheckedOnInterrupt(nanoTime() +
SECONDS.toNanos(5)); //TODO: introduce a new timeout configuration or use the
write timeout
}
}
{code}
wdyt?
> MemtablePostFlush deadlock leads to stuck nodes and crashes
> -----------------------------------------------------------
>
> Key: CASSANDRA-19564
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19564
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Compaction, Local/Memtable
> Reporter: Jon Haddad
> Priority: Urgent
> Fix For: 4.1.x
>
> Attachments: image-2024-04-16-11-55-54-750.png,
> image-2024-04-16-12-29-15-386.png, image-2024-04-16-13-43-11-064.png,
> image-2024-04-16-13-53-24-455.png, image-2024-04-17-18-46-29-474.png,
> image-2024-04-17-19-13-06-769.png, image-2024-04-17-19-14-34-344.png,
> screenshot-1.png
>
>
> I've run into an issue on a 4.1.4 cluster where an entire node has locked up
> due to what I believe is a deadlock in memtable flushing. Here's what I know
> so far. I've stitched together what happened based on conversations, logs,
> and some flame graphs.
> *Log reports memtable flushing*
> The last successful flush happens at 12:19.
> {noformat}
> INFO [NativePoolCleaner] 2024-04-16 12:19:53,634
> AbstractAllocatorMemtable.java:286 - Flushing largest CFS(Keyspace='ks',
> ColumnFamily='version') to free up room. Used total: 0.24/0.33, live:
> 0.16/0.20, flushing: 0.09/0.13, this: 0.13/0.15
> INFO [NativePoolCleaner] 2024-04-16 12:19:53,634 ColumnFamilyStore.java:1012
> - Enqueuing flush of ks.version, Reason: MEMTABLE_LIMIT, Usage: 660.521MiB
> (13%) on-heap, 790.606MiB (15%) off-heap
> {noformat}
> *MemtablePostFlush appears to be blocked*
> At this point, MemtablePostFlush completed tasks stops incrementing, active
> stays at 1 and pending starts to rise.
> {noformat}
> MemtablePostFlush 1 1 3446 0 0
> {noformat}
>
> The flame graph reveals that PostFlush.call is stuck. I don't have the line
> number, but I know we're stuck in
> {{org.apache.cassandra.db.ColumnFamilyStore.PostFlush#call}} given the visual
> below:
> *!image-2024-04-16-13-43-11-064.png!*
> *Memtable flushing is now blocked.*
> All MemtableFlushWriter threads are Parked waiting on
> {{{}OpOrder.Barrier.await{}}}. A wall clock profile of 30s reveals all time
> is spent here. Presumably we're waiting on the single threaded Post Flush.
> !image-2024-04-16-12-29-15-386.png!
> *Memtable allocations start to block*
> Eventually it looks like the NativeAllocator stops successfully allocating
> memory. I assume it's waiting on memory to be freed, but since memtable
> flushes are blocked, we wait indefinitely.
> Looking at a wall clock flame graph, all writer threads have reached the
> allocation failure path of {{MemtableAllocator.allocate()}}. I believe we're
> waiting on {{signal.awaitThrowUncheckedOnInterrupt()}}
> {noformat}
> MutationStage 48 828425 980253369 0 0{noformat}
> !image-2024-04-16-11-55-54-750.png!
>
> *Compaction Stops*
> Since we write to the compaction history table, and that requires memtables,
> compactions are now blocked as well.
>
> !image-2024-04-16-13-53-24-455.png!
>
> The node is now doing basically nothing and must be restarted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]