[
https://issues.apache.org/jira/browse/CASSANDRA-21128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053750#comment-18053750
]
rstest commented on CASSANDRA-21128:
------------------------------------
Send a simple fix for this issue!
> CompactionManager.submitMaximal crash due to node restart race condition
> ------------------------------------------------------------------------
>
> Key: CASSANDRA-21128
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21128
> Project: Apache Cassandra
> Issue Type: Bug
> Reporter: rstest
> Priority: Normal
> Time Spent: 10m
> Remaining Estimate: 0h
>
> `CompactionManager.submitMaximal()` does not handle the case where
> `getMaximalTasks()` returns `null`, leading to a `NullPointerException` when
> calling `tasks.isEmpty()`.
>
> This would happen during a node restart:
> 1. An async compaction task may be executing `submitMaximal()`
> 2. The restart causes conditions where there are uninterruptible compactions
> (e.g., compactions that couldn't be properly cancelled during shutdown)
> 3. `runWithCompactionsDisabled()` finds these uninterruptible tasks and
> returns `null`
> 4. `submitMaximal()` doesn't check for `null`, resulting in NPE
>
> {code:java}
> Timeline (NPE occurs):
> ─────────────────────────────────────────────────────────────────────────────
> [Async Task] submitMaximal() ──> getMaximalTasks() ──>
> runWithCompactionsDisabled()
> │
> [Main Thread] upgradesstables
> ──────────────────────────────────────────────────────
> │
> [Restart] ─────────────────────────────────────────── RESTART TRIGGERED
> │
> Shutdown interrupts
> compactions, but
> some
> remain
> "uninterruptible"
> │
>
> runWithCompactionsDisabled()
> returns NULL
> │
> tasks.isEmpty() →
> NPE! {code}
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.NullPointerException:
> Cannot invoke "org.apache.cassandra.db.compaction.CompactionTasks.isEmpty()"
> because "tasks" is null
> at
> org.apache.cassandra.utils.concurrent.AbstractFuture.getWhenDone(AbstractFuture.java:239)
> at
> org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:246)
> at
> ...UpgradeSSTablesTest_RestartInjected.upgradeSSTablesInterruptsOngoingCompaction(line:118)
> ...
> Caused by: java.lang.NullPointerException: Cannot invoke
> "org.apache.cassandra.db.compaction.CompactionTasks.isEmpty()" because
> "tasks" is null
> at
> org.apache.cassandra.db.compaction.CompactionManager.submitMaximal(CompactionManager.java:1001)
> at
> ...UpgradeSSTablesTest_RestartInjected.lambda$...upgradeSSTablesInterruptsOngoingCompaction$...(line:103)
> {code}
> Buggy part:
> {code:java}
> public List<Future<?>> submitMaximal(final ColumnFamilyStore cfStore, final
> long gcBefore, boolean splitOutput, OperationType operationType)
> {
> // here we compute the task off the compaction executor, so having that
> present doesn't
> // confuse runWithCompactionsDisabled -- i.e., we don't want to deadlock
> ourselves, waiting
> // for ourselves to finish/acknowledge cancellation before continuing.
> CompactionTasks tasks =
> cfStore.getCompactionStrategyManager().getMaximalTasks(gcBefore, splitOutput,
> operationType);
> if (tasks.isEmpty()) // NPE here when tasks is null
> return Collections.emptyList();
> // ...
> } {code}
> h3. Why `getMaximalTasks()` Can Return Null
> The `getMaximalTasks()` method in `CompactionStrategyManager` uses
> `runWithCompactionsDisabled()` internally:
> {code:java}
> // CompactionStrategyManager.java:1076-1100
> public CompactionTasks getMaximalTasks(final long gcBefore, final boolean
> splitOutput, OperationType operationType)
> {
> maybeReloadDiskBoundaries();
> return cfs.runWithCompactionsDisabled(() -> {
> // ... create tasks
> return CompactionTasks.create(tasks);
> }, operationType, false, false);
> } {code}
> The `runWithCompactionsDisabled()` method in `ColumnFamilyStore` can return
> `null` in two scenarios:
> 1. When there are uninterruptible higher-priority compactions
> (ColumnFamilyStore.java:2850-2858):
> {code:java}
> if (!uninterruptibleTasks.isEmpty())
> {
> logger.info("Unable to cancel in-progress compactions, since they're
> running with higher or same priority: {}...");
> return null; // Returns null here!
> } {code}
> 2. When compaction cannot be interrupted due to timeout
> (ColumnFamilyStore.java:2868-2873):
> {code:java}
> if (cfs.getTracker().getCompacting().stream().anyMatch(sstablesPredicate))
> {
> logger.warn("Unable to cancel in-progress compactions for {}...");
> return null; // Returns null here too!
> } {code}
> h3.
> h3. Proposed Fix
> I see there are indeed null checker for other usage. For example, in
> `PendingRepairHolder.getMaximalTasks()`:
>
> {code:java}
> // PendingRepairHolder.java:130-132
> Collection<AbstractCompactionTask> task = manager.getMaximalTasks(gcBefore,
> splitOutput);
> if (task != null) // Proper null check!
> tasks.addAll(task); {code}
> So the fix would just simply add a null check before calling isEmpty():
>
>
> {code:java}
> public List<Future<?>> submitMaximal(final ColumnFamilyStore cfStore, final
> long gcBefore, boolean splitOutput, OperationType operationType)
> {
> CompactionTasks tasks =
> cfStore.getCompactionStrategyManager().getMaximalTasks(gcBefore, splitOutput,
> operationType);
> if (tasks == null || tasks.isEmpty()) // Added null check
> return Collections.emptyList(); // ... rest of the method
> } {code}
> I'm happy to provide the PR for this issue.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]