[
https://issues.apache.org/jira/browse/KUDU-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823558#comment-15823558
]
zhangsong commented on KUDU-1579:
---------------------------------
After experiencing several node failure cases(using kudu-tserver revision
b906affcdee3ec814c9e96d35fea715fdbb4c330-dirty), i found these two fact.
1 when multiple kudu-tserver nodes crash at same time(not exact at same time),
(let say 5 kudu nodes), there willl be failed tablet , reasons of the failed
tablets should be thoses described in issue kudu-1449. Also from kudu-master ui
i can see a lot of addServer/removeServer task hang there and there is no sign
that they will recover automatically.
2 when facing multiply nodes crash, stop kudu-master until whole cluster is
stable(no more node crash), restart kudu-master . After recovered all crashed
kudu-tserver node , no failed tablet found.
So for my case, i seems kudu-master should freeze for sometime when facing
multiple node crashed at same time (eg.within some period of time) freeze here
, means it stop servicing addServer/RemoveServer rpc .
Just some thoughts today , may complete this later.
> into "safe mode" when large number of node crash
> --------------------------------------------------
>
> Key: KUDU-1579
> URL: https://issues.apache.org/jira/browse/KUDU-1579
> Project: Kudu
> Issue Type: New Feature
> Reporter: zhangsong
>
> Currently, replication will happen when met node crash .
> However when met large number of node crash , it will lead to replicate storm
> which will cause mess and data loss.
> replication should be prudent and the cluster should be into a "safe mode" in
> aboved node crash case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)