[jira] [Commented] (CASSANDRA-17049) Fix rare NPE caused by batchlog replay / node decomission races

Aleksey Yeschenko (Jira) Fri, 12 Nov 2021 05:54:21 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442754#comment-17442754
 ]


Aleksey Yeschenko commented on CASSANDRA-17049:
-----------------------------------------------

This is a very rare one. An example stack trace of an NPE:

{code}
ERROR 2021-10-19T08:30:16,692 [HintsWriteExecutor:1] 
org.apache.cassandra.service.CassandraDaemon:599 - Exception in thread 
Thread[HintsWriteExecutor:1,5,main]
java.lang.NullPointerException: null
        at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) ~[?:?]
        at org.apache.cassandra.hints.HintsCatalog.get(HintsCatalog.java:102) 
~[cie-cassandra-4.0.0.35.jar:4.0.0.35]
        at 
com.google.common.collect.Iterables$5.lambda$forEach$0(Iterables.java:704) 
~[guava-27.0-jre.jar:?]
        at 
com.google.common.collect.Iterables$5.lambda$forEach$0(Iterables.java:704) 
~[guava-27.0-jre.jar:?]
        at java.lang.Iterable.forEach(Iterable.java:75) ~[?:?]
        at com.google.common.collect.Iterables$5.forEach(Iterables.java:704) 
~[guava-27.0-jre.jar:?]
        at com.google.common.collect.Iterables$5.forEach(Iterables.java:704) 
~[guava-27.0-jre.jar:?]
        at 
org.apache.cassandra.hints.HintsWriteExecutor$PartiallyFlushBufferPoolTask.run(HintsWriteExecutor.java:188)
 ~[cie-cassandra-4.0.0.35.jar:4.0.0.35]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [netty-all-4.1.58.Final.jar:4.1.58.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]
{code}


> Fix rare NPE caused by batchlog replay / node decomission races
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-17049
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17049
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Batch Log, Consistency/Hints
>            Reporter: Aleksey Yeschenko
>            Assignee: Aleksey Yeschenko
>            Priority: Low
>             Fix For: 3.0.x, 3.11.x, 4.0.x, 4.x
>
>
> Batchlog replay process collects addresses of the hosts that have been hinted 
> to, so it can flush hints for them to disk before confirming deletion of the 
> replayed batches. If a node has been decommissioned during replay, however, 
> when the time comes to flush the hints at the very end of replay, 
> {{StorageService.getHostIdForEndpoint()}} will return {{null}} for its 
> address, which will, down the line, cause {{HintsCatalog::get()}} to be 
> invoked with a {{null}} host id argument, causing an NPE.
> The simple fix is to check returned host ids for addresses for nulls, and 
> collect hinted host ids instead of hinted addresses.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-17049) Fix rare NPE caused by batchlog replay / node decomission races

Reply via email to