Hello Paul, Thank you for your reply. The version is 2.2.6.
I received the logs today and can confirm three streams failed after timeout. We will try to resume the bootstrap as you recommended. I didn't use -Dreplace_address for two reasons: 1. Because someone tried to reset the node someway. Because this person is on vacation, nobody really knows what he did. I supposed he just trash the data directory and launch the node again without (-Dreplace_address) nor removing the node before. I was unsure about how valid the tokens were so I preferred to remove it to go back to a clean situation. 2. Since the replacing node and the new node have the same endpoint address (this is a fresh version of the same node) I was not sure the replace_address will not be confused. Since I had time and was not sure that replacing the node would work in my situation, I chose the slow safe way. Maybe I could have used it. -- Jérôme Mainaud jer...@mainaud.com 2016-08-15 20:51 GMT+02:00 Paulo Motta <pauloricard...@gmail.com>: > What version are you in? This seems like a typical case were there was a > problem with streaming (hanging, etc), do you have access to the logs? > Maybe look for streaming errors? Typically streaming errors are related to > timeouts, so you should review your cassandra > streaming_socket_timeout_in_ms and kernel tcp_keepalive settings. > > If you're on 2.2+ you can resume a failed bootstrap with nodetool > bootstrap resume. There were also some streaming hanging problems fixed > recently, so I'd advise you to upgrade to the latest version of your > particular series for a more robust version. > > Is there any reason why you didn't use the replace procedure > (-Dreplace_address) to replace the node with the same tokens? This would be > a bit faster than remove + bootstrap procedure. > > 2016-08-15 15:37 GMT-03:00 Jérôme Mainaud <jer...@mainaud.com>: > >> Hello, >> >> A client of mime have problems when adding a node in the cluster. >> After 4 days, the node is still in joining mode, it doesn't have the same >> level of load than the other and there seems to be no streaming from and to >> the new node. >> >> This node has a history. >> >> 1. At the begin, it was in a seed in the cluster. >> 2. Ops detected that client had problems with it. >> 3. They tried to reset it but failed. In their process they launched >> several repair and rebuild process on the node. >> 4. Then they asked me to help them. >> 5. We stopped the node, >> 6. removed it from the list of seeds (more precisely it was replaced >> by another node), >> 7. removed it from the cluster (I choose not to use decommission >> since node data was compromised) >> 8. deleted all files from data, commitlog and savedcache directories. >> 9. after the leaving process ended, it was started as a fresh new >> node and began autobootstrap. >> >> >> As I don’t have direct access to the cluster I don't have a lot of >> information, but I will have tomorrow (logs and results of some commands). >> And I can ask for people any required information. >> >> Does someone have any idea of what could have happened and what I should >> investigate first ? >> What would you do to unlock the situation ? >> >> Context: The cluster consists of two DC, each with 15 nodes. Average load >> is around 3 TB per node. The joining node froze a little after 2 TB. >> >> Thank you for your help. >> Cheers, >> >> >> -- >> Jérôme Mainaud >> jer...@mainaud.com >> > >