[ 
https://issues.apache.org/jira/browse/CASSANDRA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997633#comment-14997633
 ] 

Paulo Motta commented on CASSANDRA-10485:
-----------------------------------------

I just found a bug on the previous version, where a node can be removed from 
TMD just before setting the new pending ranges, so the problem will persist. 

After thinking this through with a fresh mind, the solution is rather simple, I 
think I was over-complicating. Pending ranges are basically composed of 
"normal" endpoints, moving endpoints and bootstrapping endpoints. Moving 
endpoints are also "normal" endpoints. So, what we actually want to check 
before submitting a hint, is if the node is a normal endpoint or bootstrapping 
endpoint. If the node is neither a normal/moving or bootstrapping endpoint, we 
don't want to submit hints to it, simple as that. So, I added a new method 
{{TokenMetadata.isMemberOrJoining}} to check that before submitting a hint, 
thus avoiding getting a null host id on hint submission.

The two reports of this bug on CASSANDRA-6335 and CASSANDRA-10233, are when a 
node is replaced or when bootstrapping fails. When a node is replaced, it was a 
"normal" endpoint, but then it was replaced and it was removed from the ring, 
so we shouldn't submit a hint to it. When a new node is down after a failed 
bootstrap, it is removed from the ring, so we shouldn't submit a hint to it. 
Actually, with CASSANDRA-8838, there's a possibility of resuming a failed 
bootstrap, so we should not remove the bootstrapping node from the ring for a 
quarantine period, but we should handle this in a separate ticket.

Submitted a new branch with the proposed solution. Sorry for the confusion on 
this.
||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10485-final]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-10485-final]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-10485-final]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10485-final]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-final-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-final-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-final-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-final-testall/lastCompletedBuild/testReport/]|
|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-final-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-final-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-final-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-final-dtest/lastCompletedBuild/testReport/]|

> Missing host ID on hinted handoff write
> ---------------------------------------
>
>                 Key: CASSANDRA-10485
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10485
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>             Fix For: 2.1.x, 2.2.x, 3.0.x
>
>
> when I restart one of them I receive the error "Missing host ID":
> {noformat}
> WARN  [SharedPool-Worker-1] 2015-10-08 13:15:33,882 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-1,5,main]: {}
> java.lang.AssertionError: Missing host ID for 63.251.156.141
>         at 
> org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:978)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:950)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:2235)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_60]
>         at 
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
> [apache-cassandra-2.1.3.jar:2.1.3]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> If I made nodetool status, the problematic node has ID:
> {noformat}
> UN  10.10.10.12  1.3 TB     1       ?       
> 4d5c8fd2-a909-4f09-a23c-4cd6040f338a  rack3
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to