Hello Paul,

Thank you for your reply.
The version is 2.2.6.

I received the logs today and can confirm three streams failed after
timeout. We will try to resume the bootstrap as you recommended.

I didn't use -Dreplace_address for two reasons:

   1. Because someone tried to reset the node someway. Because this person
   is on vacation, nobody really knows what he did. I supposed he just trash
   the data directory and launch the node again without (-Dreplace_address)
   nor removing the node before. I was unsure about how valid the tokens were
   so I preferred to remove it to go back to a clean situation.
   2. Since the replacing node and the new node have the same endpoint
   address (this is a fresh version of the same node) I was not sure the
   replace_address will not be confused.

Since I had time and was not sure that replacing the node would work in my
situation, I chose the slow safe way. Maybe I could have used it.





-- 
Jérôme Mainaud
jer...@mainaud.com

2016-08-15 20:51 GMT+02:00 Paulo Motta <pauloricard...@gmail.com>:

> What version are you in? This seems like a typical case were there was a
> problem with streaming (hanging, etc), do you have access to the logs?
> Maybe look for streaming errors? Typically streaming errors are related to
> timeouts, so you should review your cassandra
> streaming_socket_timeout_in_ms and kernel tcp_keepalive settings.
>
> If you're on 2.2+ you can resume a failed bootstrap with nodetool
> bootstrap resume. There were also some streaming hanging problems fixed
> recently, so I'd advise you to upgrade to the latest version of your
> particular series for a more robust version.
>
> Is there any reason why you didn't use the replace procedure
> (-Dreplace_address) to replace the node with the same tokens? This would be
> a bit faster than remove + bootstrap procedure.
>
> 2016-08-15 15:37 GMT-03:00 Jérôme Mainaud <jer...@mainaud.com>:
>
>> Hello,
>>
>> A client of mime have problems when adding a node in the cluster.
>> After 4 days, the node is still in joining mode, it doesn't have the same
>> level of load than the other and there seems to be no streaming from and to
>> the new node.
>>
>> This node has a history.
>>
>>    1. At the begin, it was in a seed in the cluster.
>>    2. Ops detected that client had problems with it.
>>    3. They tried to reset it but failed. In their process they launched
>>    several repair and rebuild process on the node.
>>    4. Then they asked me to help them.
>>    5. We stopped the node,
>>    6. removed it from the list of seeds (more precisely it was replaced
>>    by another node),
>>    7. removed it from the cluster (I choose not to use decommission
>>    since node data was compromised)
>>    8. deleted all files from data, commitlog and savedcache directories.
>>    9. after the leaving process ended, it was started as a fresh new
>>    node and began autobootstrap.
>>
>>
>> As I don’t have direct access to the cluster I don't have a lot of
>> information, but I will have tomorrow (logs and results of some commands).
>> And I can ask for people any required information.
>>
>> Does someone have any idea of what could have happened and what I should
>> investigate first ?
>> What would you do to unlock the situation ?
>>
>> Context: The cluster consists of two DC, each with 15 nodes. Average load
>> is around 3 TB per node. The joining node froze a little after 2 TB.
>>
>> Thank you for your help.
>> Cheers,
>>
>>
>> --
>> Jérôme Mainaud
>> jer...@mainaud.com
>>
>
>

Reply via email to