ConfX created CASSANDRA-21128:
---------------------------------

             Summary: CompactionManager.submitMaximal crash due to node restart 
race condition
                 Key: CASSANDRA-21128
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21128
             Project: Apache Cassandra
          Issue Type: Bug
            Reporter: ConfX


`CompactionManager.submitMaximal()` does not handle the case where 
`getMaximalTasks()` returns `null`, leading to a `NullPointerException` when 
calling `tasks.isEmpty()`.
 
This would happen during a node restart:
1. An async compaction task may be executing `submitMaximal()`
2. The restart causes conditions where there are uninterruptible compactions 
(e.g., compactions that couldn't be properly cancelled during shutdown)
3. `runWithCompactionsDisabled()` finds these uninterruptible tasks and returns 
`null`
4. `submitMaximal()` doesn't check for `null`, resulting in NPE
 
{code:java}
Timeline A (NPE occurs):
─────────────────────────────────────────────────────────────────────────────
[Async Task]     submitMaximal() ──> getMaximalTasks() ──> 
runWithCompactionsDisabled()
                                                                    │
[Main Thread]    upgradesstables 
──────────────────────────────────────────────────────
                                                                    │
[Restart]        ─────────────────────────────────────────── RESTART TRIGGERED
                                                                    │
                                                           Shutdown interrupts
                                                           compactions, but some
                                                           remain 
"uninterruptible"
                                                                    │
                                                           
runWithCompactionsDisabled()
                                                           returns NULL
                                                                    │
                                                           tasks.isEmpty() → 
NPE! {code}
{code:java}
java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot 
invoke "org.apache.cassandra.db.compaction.CompactionTasks.isEmpty()" because 
"tasks" is null
    at 
org.apache.cassandra.utils.concurrent.AbstractFuture.getWhenDone(AbstractFuture.java:239)
    at 
org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:246)
    at 
...UpgradeSSTablesTest_RestartInjected.upgradeSSTablesInterruptsOngoingCompaction(line:118)
    ...
Caused by: java.lang.NullPointerException: Cannot invoke 
"org.apache.cassandra.db.compaction.CompactionTasks.isEmpty()" because "tasks" 
is null
    at 
org.apache.cassandra.db.compaction.CompactionManager.submitMaximal(CompactionManager.java:1001)
    at 
...UpgradeSSTablesTest_RestartInjected.lambda$...upgradeSSTablesInterruptsOngoingCompaction$...(line:103)
 {code}
Buggy part:
{code:java}
public List<Future<?>> submitMaximal(final ColumnFamilyStore cfStore, final 
long gcBefore, boolean splitOutput, OperationType operationType)
{
    // here we compute the task off the compaction executor, so having that 
present doesn't
    // confuse runWithCompactionsDisabled -- i.e., we don't want to deadlock 
ourselves, waiting
    // for ourselves to finish/acknowledge cancellation before continuing.
    CompactionTasks tasks = 
cfStore.getCompactionStrategyManager().getMaximalTasks(gcBefore, splitOutput, 
operationType);    
    if (tasks.isEmpty())  // NPE here when tasks is null
        return Collections.emptyList();
    // ...
} {code}
h3. Why `getMaximalTasks()` Can Return Null
The `getMaximalTasks()` method in `CompactionStrategyManager` uses 
`runWithCompactionsDisabled()` internally:
{code:java}
// CompactionStrategyManager.java:1076-1100
public CompactionTasks getMaximalTasks(final long gcBefore, final boolean 
splitOutput, OperationType operationType)
{
    maybeReloadDiskBoundaries();
    return cfs.runWithCompactionsDisabled(() -> {
        // ... create tasks
        return CompactionTasks.create(tasks);
    }, operationType, false, false);
} {code}
The `runWithCompactionsDisabled()` method in `ColumnFamilyStore` can return 
`null` in two scenarios:
1. When there are uninterruptible higher-priority compactions 
(ColumnFamilyStore.java:2850-2858):
{code:java}
if (!uninterruptibleTasks.isEmpty())
{
    logger.info("Unable to cancel in-progress compactions, since they're 
running with higher or same priority: {}...");
    return null;  // Returns null here!
} {code}
2. When compaction cannot be interrupted due to timeout 
(ColumnFamilyStore.java:2868-2873):
{code:java}
if (cfs.getTracker().getCompacting().stream().anyMatch(sstablesPredicate))
{
    logger.warn("Unable to cancel in-progress compactions for {}...");
    return null;  // Returns null here too!
} {code}
h3.  
h3. Proposed Fix

I see there are indeed null checker for other usage. For example, in 
`PendingRepairHolder.getMaximalTasks()`:

 
{code:java}
// PendingRepairHolder.java:130-132
Collection<AbstractCompactionTask> task = manager.getMaximalTasks(gcBefore, 
splitOutput);
if (task != null)  // Proper null check!
    tasks.addAll(task); {code}
So the fix would just simply add a null check before calling isEmpty():
 

 
{code:java}
public List<Future<?>> submitMaximal(final ColumnFamilyStore cfStore, final 
long gcBefore, boolean splitOutput, OperationType operationType)
{
    CompactionTasks tasks = 
cfStore.getCompactionStrategyManager().getMaximalTasks(gcBefore, splitOutput, 
operationType);    
    if (tasks == null || tasks.isEmpty())  // Added null check
        return Collections.emptyList();    // ... rest of the method
} {code}
I'm happy to provide the PR for this issue.
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to