[
https://issues.apache.org/jira/browse/CASSANDRA-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gil Ganz updated CASSANDRA-21217:
---------------------------------
Description:
I am running decommission on 4.1.5 cluster, and decommission fails to complete
due to error regarding hints. It runs for quite some time, streaming what
appears to be all data, but then fails due to this error (which happened quite
early in the decommission process).
I got this in 3 separate cases (this is an env that is spread across the world,
so network hiccups are common).
I was able to overcome this by setting transfer_hints_on_decommission: false,
but I think the code that handles that hints in the decommision path should not
fail on missing file, it can just throw a warning, and not require me to not
transfer any hint.
INFO [NonPeriodicTasks:1] 2026-03-12 18:13:47,148 StreamResultFuture.java:252
- [Stream
#24728df0-1e20-11f1-be78-9bd75fd01983|#24728df0-1e20-11f1-be78-9bd75fd01983]
All sessions completed
ERROR [RMI TCP Connection(3338753)-127.0.0.1] 2026-03-12 18:13:47,149
StorageService.java:5017 - Error while decommissioning node
java.lang.RuntimeException: java.nio.file.NoSuchFileException:
/var/lib/cassandra/data/disk1/hints/5da9d583-259e-425f-a0ad-18b7e744dabc-1744041563101-2.hints
at
org.apache.cassandra.io.util.ChannelProxy.openChannel(ChannelProxy.java:54)
at
org.apache.cassandra.io.util.ChannelProxy.<init>(ChannelProxy.java:65)
at
org.apache.cassandra.hints.ChecksummedDataInput.open(ChecksummedDataInput.java:76)
at org.apache.cassandra.hints.HintsReader.open(HintsReader.java:78)
at
org.apache.cassandra.hints.HintsDispatcher.create(HintsDispatcher.java:79)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.deliver(HintsDispatchExecutor.java:290)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:277)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:255)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.run(HintsDispatchExecutor.java:234)
at
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at
java.base/java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3603)
at
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
at
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.transfer(HintsDispatchExecutor.java:196)
at
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.run(HintsDispatchExecutor.java:169)
Proposed fix
1. Catch missing file in {{deliver()}} ({{{}HintsDispatchExecutor.java{}}}
~line 287)
Wrap the {{HintsDispatcher.create()}} call to catch {{RuntimeException}}
caused by {{{}NoSuchFileException{}}}. If the file is gone, the hints were
already successfully
dispatched — delete the descriptor from the store and return {{{}true{}}}.
\{code:java}
private boolean deliver(HintsDescriptor descriptor, InetAddressAndPort
address)
{
File file = descriptor.file(hintsDirectory);
if (!file.exists())
{
logger.info("Hints file {} was already dispatched, skipping", file);
store.cleanUp(descriptor);
return true;
}
// ... existing code
}
\{code}
Note: a simple {{file.exists()}} pre-check narrows the window but does not
eliminate the TOCTOU race. The {{try-catch}} around
{{HintsDispatcher.create()}} is still
needed as a backstop:
\{code:java}
try (HintsDispatcher dispatcher = HintsDispatcher.create(file, rateLimiter,
address, descriptor.hostId, shouldAbort))
{ // ... existing dispatch logic }
catch (RuntimeException e)
{
if (Throwables.getRootCause(e) instanceof NoSuchFileException)
{
logger.info("Hints file {} disappeared during dispatch, treating as
already dispatched", file);
store.cleanUp(descriptor);
return true;
}
throw e;
}
\{code}
2. Prevent the race in {{transferHints()}} ({{{}HintsService.java{}}} ~line
440)
After {{completeDispatchBlockingly()}} at line 444, call {{pauseDispatch()}}
again before starting the transfer at line 446. This prevents
{{HintsDispatchTrigger}}
from scheduling new normal dispatch tasks that race with the transfer:
\{code:java}
// current code at line 441:
resumeDispatch();
catalog.stores().forEach(dispatchExecutor::completeDispatchBlockingly);
// add:
pauseDispatch();
return dispatchExecutor.transfer(catalog, hostIdSupplier);
\{code}
was:
I am running decommission on 4.1.5 cluster, and decommission fails to complete
due to error regarding hints. It runs for quite some time, streaming what
appears to be all data, but then fails due to this error (which happened quite
early in the decommission process).
I got this in 3 separate cases (this is an env that is spread across the world,
so network hiccups are common).
I was able to overcome this by setting transfer_hints_on_decommission: false,
but I think the code that handles that hints in the decommision path should not
fail on missing file, it can just throw a warning, and not require me to not
transfer any hint.
INFO [NonPeriodicTasks:1] 2026-03-12 18:13:47,148 StreamResultFuture.java:252
- [Stream #24728df0-1e20-11f1-be78-9bd75fd01983] All sessions completed
ERROR [RMI TCP Connection(3338753)-127.0.0.1] 2026-03-12 18:13:47,149
StorageService.java:5017 - Error while decommissioning node
java.lang.RuntimeException: java.nio.file.NoSuchFileException:
/var/lib/cassandra/data/disk1/hints/5da9d583-259e-425f-a0ad-18b7e744dabc-1744041563101-2.hints
at
org.apache.cassandra.io.util.ChannelProxy.openChannel(ChannelProxy.java:54)
at
org.apache.cassandra.io.util.ChannelProxy.<init>(ChannelProxy.java:65)
at
org.apache.cassandra.hints.ChecksummedDataInput.open(ChecksummedDataInput.java:76)
at org.apache.cassandra.hints.HintsReader.open(HintsReader.java:78)
at
org.apache.cassandra.hints.HintsDispatcher.create(HintsDispatcher.java:79)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.deliver(HintsDispatchExecutor.java:290)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:277)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:255)
at
org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.run(HintsDispatchExecutor.java:234)
at
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at
java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at
java.base/java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3603)
at
java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at
java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at
java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at
java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at
java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
at
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.transfer(HintsDispatchExecutor.java:196)
at
org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.run(HintsDispatchExecutor.java:169)
Proposed fix
1. Catch missing file in \{{deliver()}} (\{{HintsDispatchExecutor.java}}
~line 290)
Wrap the \{{HintsDispatcher.create()}} call to catch \{{RuntimeException}}
caused by \{{NoSuchFileException}}. If the file is gone, the hints were already
successfully
dispatched — delete the descriptor from the store and return \{{true}}.
\{code:java}
private boolean deliver(HintsDescriptor descriptor, InetAddressAndPort
address)
{
File file = descriptor.file(hintsDirectory);
if (!file.exists())
{
logger.info("Hints file {} was already dispatched, skipping", file);
store.cleanUp(descriptor);
return true;
}
// ... existing code
}
\{code}
Note: a simple \{{file.exists()}} pre-check narrows the window but does not
eliminate the TOCTOU race. The \{{try-catch}} around
\{{HintsDispatcher.create()}} is still
needed as a backstop:
\{code:java}
try (HintsDispatcher dispatcher = HintsDispatcher.create(file, rateLimiter,
address, descriptor.hostId, shouldAbort))
{
// ... existing dispatch logic
}
catch (RuntimeException e)
{
if (Throwables.getRootCause(e) instanceof NoSuchFileException)
{
logger.info("Hints file {} disappeared during dispatch, treating as
already dispatched", file);
store.cleanUp(descriptor);
return true;
}
throw e;
}
\{code}
2. Prevent the race in \{{transferHints()}} (\{{HintsService.java}} ~line 440)
After \{{completeDispatchBlockingly()}} at line 444, call
\{{pauseDispatch()}} again before starting the transfer at line 446. This
prevents \{{HintsDispatchTrigger}}
from scheduling new normal dispatch tasks that race with the transfer:
\{code:java}
// current code at line 441:
resumeDispatch();
catalog.stores().forEach(dispatchExecutor::completeDispatchBlockingly);
// add:
pauseDispatch();
return dispatchExecutor.transfer(catalog, hostIdSupplier);
\{code}
> Race condition between hints and decommision
> --------------------------------------------
>
> Key: CASSANDRA-21217
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21217
> Project: Apache Cassandra
> Issue Type: Bug
> Reporter: Gil Ganz
> Priority: Normal
>
> I am running decommission on 4.1.5 cluster, and decommission fails to
> complete due to error regarding hints. It runs for quite some time, streaming
> what appears to be all data, but then fails due to this error (which happened
> quite early in the decommission process).
> I got this in 3 separate cases (this is an env that is spread across the
> world, so network hiccups are common).
> I was able to overcome this by setting transfer_hints_on_decommission: false,
> but I think the code that handles that hints in the decommision path should
> not fail on missing file, it can just throw a warning, and not require me to
> not transfer any hint.
> INFO [NonPeriodicTasks:1] 2026-03-12 18:13:47,148
> StreamResultFuture.java:252 - [Stream
> #24728df0-1e20-11f1-be78-9bd75fd01983|#24728df0-1e20-11f1-be78-9bd75fd01983]
> All sessions completed
> ERROR [RMI TCP Connection(3338753)-127.0.0.1] 2026-03-12 18:13:47,149
> StorageService.java:5017 - Error while decommissioning node
> java.lang.RuntimeException: java.nio.file.NoSuchFileException:
> /var/lib/cassandra/data/disk1/hints/5da9d583-259e-425f-a0ad-18b7e744dabc-1744041563101-2.hints
> at
> org.apache.cassandra.io.util.ChannelProxy.openChannel(ChannelProxy.java:54)
> at
> org.apache.cassandra.io.util.ChannelProxy.<init>(ChannelProxy.java:65)
> at
> org.apache.cassandra.hints.ChecksummedDataInput.open(ChecksummedDataInput.java:76)
> at org.apache.cassandra.hints.HintsReader.open(HintsReader.java:78)
> at
> org.apache.cassandra.hints.HintsDispatcher.create(HintsDispatcher.java:79)
> at
> org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.deliver(HintsDispatchExecutor.java:290)
> at
> org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:277)
> at
> org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(HintsDispatchExecutor.java:255)
> at
> org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.run(HintsDispatchExecutor.java:234)
> at
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> at
> java.base/java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3603)
> at
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> at
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> at
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
> at
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
> at
> org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.transfer(HintsDispatchExecutor.java:196)
> at
> org.apache.cassandra.hints.HintsDispatchExecutor$TransferHintsTask.run(HintsDispatchExecutor.java:169)
> Proposed fix
> 1. Catch missing file in {{deliver()}} ({{{}HintsDispatchExecutor.java{}}}
> ~line 287)
> Wrap the {{HintsDispatcher.create()}} call to catch {{RuntimeException}}
> caused by {{{}NoSuchFileException{}}}. If the file is gone, the hints were
> already successfully
> dispatched — delete the descriptor from the store and return {{{}true{}}}.
> \{code:java}
> private boolean deliver(HintsDescriptor descriptor, InetAddressAndPort
> address)
> {
> File file = descriptor.file(hintsDirectory);
> if (!file.exists())
> {
> logger.info("Hints file {} was already dispatched, skipping", file);
> store.cleanUp(descriptor);
> return true;
> }
> // ... existing code
> }
> \{code}
> Note: a simple {{file.exists()}} pre-check narrows the window but does not
> eliminate the TOCTOU race. The {{try-catch}} around
> {{HintsDispatcher.create()}} is still
> needed as a backstop:
> \{code:java}
> try (HintsDispatcher dispatcher = HintsDispatcher.create(file, rateLimiter,
> address, descriptor.hostId, shouldAbort))
>
> { // ... existing dispatch logic }
> catch (RuntimeException e)
> {
> if (Throwables.getRootCause(e) instanceof NoSuchFileException)
> {
> logger.info("Hints file {} disappeared during dispatch, treating as
> already dispatched", file);
> store.cleanUp(descriptor);
> return true;
> }
> throw e;
> }
> \{code}
> 2. Prevent the race in {{transferHints()}} ({{{}HintsService.java{}}} ~line
> 440)
> After {{completeDispatchBlockingly()}} at line 444, call
> {{pauseDispatch()}} again before starting the transfer at line 446. This
> prevents {{HintsDispatchTrigger}}
> from scheduling new normal dispatch tasks that race with the transfer:
> \{code:java}
> // current code at line 441:
> resumeDispatch();
> catalog.stores().forEach(dispatchExecutor::completeDispatchBlockingly);
> // add:
> pauseDispatch();
> return dispatchExecutor.transfer(catalog, hostIdSupplier);
> \{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]