[
https://issues.apache.org/jira/browse/CASSANDRA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994779#comment-14994779
]
Paulo Motta commented on CASSANDRA-10485:
-----------------------------------------
I implemented an alternative approach which is a bit cleaner and more
deterministic. The basic idea is to have a new method
{{TokenMetadata.isMemberOrPending()}}, and only submit hints to endpoints that
are ring members or pending membership, thus, avoiding fetching null host IDs
for removed pending endpoints while the new pending ranges are being calculated.
In order to support the {{TokenMetadata.isMemberOrPending()}} method, the
{{TokenMetadata}} maintains a new {{livePendingEndpoints}} set which is
populated every time new pending ranges are set. When endpoints are removed
from {{TokenMetadata}} via the {{removeEndpoint}} method, they're also removed
from the {{livePendingEndpoints}} set, so {{TokenMetadata.isMemberOrPending()}}
returns false if the endpoint is evicted from the ring. Since both
{{removeEndpoint}} and {{setPendingRanges}} update this set, they share a write
lock. {{TokenMetadata.isMemberOrPending()}} also uses a read lock, similar to
other methods {{isMember()}} or {{getHostId()}}.
Merging the solution from 2.1 to 2.2/3.0 was a bit tricky because the pending
ranges calculation was extracted from the {{PendingRangeCalculatorService}} to
{{TokenMetadata}} within a read lock, so I had to separate the actual
calculation (within a read lock) to the actual assignment of the
{{pendingRanges}} via the {{setPendingRanges}} method, which uses a write lock.
On 3.0, the hints submission part is slightly different (even simpler) due to
the new hints implementation.
It's still not ideal but I guess better than the previous approach. I will add
a link from this ticket to CASSANDRA-6061 so we can take this ticket into
account when refactoring the {{TokenMetadata}}.
Below are the new branches and test results:
||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10485-v3]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-10485-v3]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-10485-v3]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10485-v3]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-v3-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-v3-testall/lastCompletedBuild/testReport/]|
|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-v3-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-v3-dtest/lastCompletedBuild/testReport/]|
> Missing host ID on hinted handoff write
> ---------------------------------------
>
> Key: CASSANDRA-10485
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10485
> Project: Cassandra
> Issue Type: Bug
> Reporter: Paulo Motta
> Assignee: Paulo Motta
> Fix For: 2.1.x, 2.2.x, 3.0.x
>
>
> when I restart one of them I receive the error "Missing host ID":
> {noformat}
> WARN [SharedPool-Worker-1] 2015-10-08 13:15:33,882
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-1,5,main]: {}
> java.lang.AssertionError: Missing host ID for 63.251.156.141
> at
> org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:978)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at
> org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:950)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at
> org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:2235)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_60]
> at
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-2.1.3.jar:2.1.3]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> If I made nodetool status, the problematic node has ID:
> {noformat}
> UN 10.10.10.12 1.3 TB 1 ?
> 4d5c8fd2-a909-4f09-a23c-4cd6040f338a rack3
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)