[
https://issues.apache.org/jira/browse/CASSANDRA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997633#comment-14997633
]
Paulo Motta commented on CASSANDRA-10485:
-----------------------------------------
I just found a bug on the previous version, where a node can be removed from
TMD just before setting the new pending ranges, so the problem will persist.
After thinking this through with a fresh mind, the solution is rather simple, I
think I was over-complicating. Pending ranges are basically composed of
"normal" endpoints, moving endpoints and bootstrapping endpoints. Moving
endpoints are also "normal" endpoints. So, what we actually want to check
before submitting a hint, is if the node is a normal endpoint or bootstrapping
endpoint. If the node is neither a normal/moving or bootstrapping endpoint, we
don't want to submit hints to it, simple as that. So, I added a new method
{{TokenMetadata.isMemberOrJoining}} to check that before submitting a hint,
thus avoiding getting a null host id on hint submission.
The two reports of this bug on CASSANDRA-6335 and CASSANDRA-10233, are when a
node is replaced or when bootstrapping fails. When a node is replaced, it was a
"normal" endpoint, but then it was replaced and it was removed from the ring,
so we shouldn't submit a hint to it. When a new node is down after a failed
bootstrap, it is removed from the ring, so we shouldn't submit a hint to it.
Actually, with CASSANDRA-8838, there's a possibility of resuming a failed
bootstrap, so we should not remove the bootstrapping node from the ring for a
quarantine period, but we should handle this in a separate ticket.
Submitted a new branch with the proposed solution. Sorry for the confusion on
this.
||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10485-final]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-10485-final]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-10485-final]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10485-final]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-final-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-final-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-final-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-final-testall/lastCompletedBuild/testReport/]|
|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-final-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-final-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-final-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-final-dtest/lastCompletedBuild/testReport/]|
> Missing host ID on hinted handoff write
> ---------------------------------------
>
> Key: CASSANDRA-10485
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10485
> Project: Cassandra
> Issue Type: Bug
> Reporter: Paulo Motta
> Assignee: Paulo Motta
> Fix For: 2.1.x, 2.2.x, 3.0.x
>
>
> when I restart one of them I receive the error "Missing host ID":
> {noformat}
> WARN [SharedPool-Worker-1] 2015-10-08 13:15:33,882
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-1,5,main]: {}
> java.lang.AssertionError: Missing host ID for 63.251.156.141
> at
> org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:978)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at
> org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:950)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at
> org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:2235)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_60]
> at
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-2.1.3.jar:2.1.3]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> If I made nodetool status, the problematic node has ID:
> {noformat}
> UN 10.10.10.12 1.3 TB 1 ?
> 4d5c8fd2-a909-4f09-a23c-4cd6040f338a rack3
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)