ConfX created CASSANDRA-21128:
---------------------------------
Summary: CompactionManager.submitMaximal crash due to node restart
race condition
Key: CASSANDRA-21128
URL: https://issues.apache.org/jira/browse/CASSANDRA-21128
Project: Apache Cassandra
Issue Type: Bug
Reporter: ConfX
`CompactionManager.submitMaximal()` does not handle the case where
`getMaximalTasks()` returns `null`, leading to a `NullPointerException` when
calling `tasks.isEmpty()`.
This would happen during a node restart:
1. An async compaction task may be executing `submitMaximal()`
2. The restart causes conditions where there are uninterruptible compactions
(e.g., compactions that couldn't be properly cancelled during shutdown)
3. `runWithCompactionsDisabled()` finds these uninterruptible tasks and returns
`null`
4. `submitMaximal()` doesn't check for `null`, resulting in NPE
{code:java}
Timeline A (NPE occurs):
─────────────────────────────────────────────────────────────────────────────
[Async Task] submitMaximal() ──> getMaximalTasks() ──>
runWithCompactionsDisabled()
│
[Main Thread] upgradesstables
──────────────────────────────────────────────────────
│
[Restart] ─────────────────────────────────────────── RESTART TRIGGERED
│
Shutdown interrupts
compactions, but some
remain
"uninterruptible"
│
runWithCompactionsDisabled()
returns NULL
│
tasks.isEmpty() →
NPE! {code}
{code:java}
java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot
invoke "org.apache.cassandra.db.compaction.CompactionTasks.isEmpty()" because
"tasks" is null
at
org.apache.cassandra.utils.concurrent.AbstractFuture.getWhenDone(AbstractFuture.java:239)
at
org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:246)
at
...UpgradeSSTablesTest_RestartInjected.upgradeSSTablesInterruptsOngoingCompaction(line:118)
...
Caused by: java.lang.NullPointerException: Cannot invoke
"org.apache.cassandra.db.compaction.CompactionTasks.isEmpty()" because "tasks"
is null
at
org.apache.cassandra.db.compaction.CompactionManager.submitMaximal(CompactionManager.java:1001)
at
...UpgradeSSTablesTest_RestartInjected.lambda$...upgradeSSTablesInterruptsOngoingCompaction$...(line:103)
{code}
Buggy part:
{code:java}
public List<Future<?>> submitMaximal(final ColumnFamilyStore cfStore, final
long gcBefore, boolean splitOutput, OperationType operationType)
{
// here we compute the task off the compaction executor, so having that
present doesn't
// confuse runWithCompactionsDisabled -- i.e., we don't want to deadlock
ourselves, waiting
// for ourselves to finish/acknowledge cancellation before continuing.
CompactionTasks tasks =
cfStore.getCompactionStrategyManager().getMaximalTasks(gcBefore, splitOutput,
operationType);
if (tasks.isEmpty()) // NPE here when tasks is null
return Collections.emptyList();
// ...
} {code}
h3. Why `getMaximalTasks()` Can Return Null
The `getMaximalTasks()` method in `CompactionStrategyManager` uses
`runWithCompactionsDisabled()` internally:
{code:java}
// CompactionStrategyManager.java:1076-1100
public CompactionTasks getMaximalTasks(final long gcBefore, final boolean
splitOutput, OperationType operationType)
{
maybeReloadDiskBoundaries();
return cfs.runWithCompactionsDisabled(() -> {
// ... create tasks
return CompactionTasks.create(tasks);
}, operationType, false, false);
} {code}
The `runWithCompactionsDisabled()` method in `ColumnFamilyStore` can return
`null` in two scenarios:
1. When there are uninterruptible higher-priority compactions
(ColumnFamilyStore.java:2850-2858):
{code:java}
if (!uninterruptibleTasks.isEmpty())
{
logger.info("Unable to cancel in-progress compactions, since they're
running with higher or same priority: {}...");
return null; // Returns null here!
} {code}
2. When compaction cannot be interrupted due to timeout
(ColumnFamilyStore.java:2868-2873):
{code:java}
if (cfs.getTracker().getCompacting().stream().anyMatch(sstablesPredicate))
{
logger.warn("Unable to cancel in-progress compactions for {}...");
return null; // Returns null here too!
} {code}
h3.
h3. Proposed Fix
I see there are indeed null checker for other usage. For example, in
`PendingRepairHolder.getMaximalTasks()`:
{code:java}
// PendingRepairHolder.java:130-132
Collection<AbstractCompactionTask> task = manager.getMaximalTasks(gcBefore,
splitOutput);
if (task != null) // Proper null check!
tasks.addAll(task); {code}
So the fix would just simply add a null check before calling isEmpty():
{code:java}
public List<Future<?>> submitMaximal(final ColumnFamilyStore cfStore, final
long gcBefore, boolean splitOutput, OperationType operationType)
{
CompactionTasks tasks =
cfStore.getCompactionStrategyManager().getMaximalTasks(gcBefore, splitOutput,
operationType);
if (tasks == null || tasks.isEmpty()) // Added null check
return Collections.emptyList(); // ... rest of the method
} {code}
I'm happy to provide the PR for this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]