LebronAl edited a comment on issue #3954: URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517
> You may think it is easier to switch to a new frame than to improve the current one, but it is totally not the case. Implementing a distributed version involves much more than calling some methods in a library. Interface adaption, schema conversion, exception handling, cluster organization, and data distribution... there are so many to do. +1 The goal for all of us was to make the cluster module more stable: some poeple felt it was better to use a mature Raft library because Raft was hard to be implemented correctly, but I also observed [Kafka](https://github.com/apache/kafka/tree/trunk/raft) writing their own Raft instead of using other Raft libraries. Other people don't feel the need to make big changes right now because they don't seem to have any problems at the moment, but using a great Raft library like etcd frees up the consensus bottleneck of the entire cloud native. Being pragmatic, we have to admit that even if we decided to use another raft library, it wouldn't have been possible in a month or two. Also, most of the bugs we've fixed so far have nothing to do with consensus. Therefore, the decision to move precipitously requires a great deal of risk. So I suggest we go in three directions in parallel: research + refactoring + testing. - Research: Follow my list of 5 concerns to see how other libraries are doing; Understanding these things will help us understand the Raft algorithm more deeply. Whether the Raft library is replaced or not, this is better for cluster IoTDB because the people who developed it know more about Raft. - Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For anyone who wants to change raft library, this is basically the process of changing library, and we must to do some abstraction in order to change current Raft algorithm. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. I don't think we would be as divergent as we are now. - Testing: After nearly a year of maintaining cluster modules, I believe that most of the bugs fixed so far have nothing to do with consensus and will show up even replacing raft library. Of course, I'm not saying that the Raft we implemented currently had no bugs, it was probably due to a lack of testing and a lack of production cases. Therefore, I suggest that we can fully test the cluster module from now on, and according to the test results we can make the next step of judgment. In addition, I am currently investigating and designing cluster [chaos-test framework](https://gitlab.summer-ospp.ac.cn/summer2021/210070607). If everything goes well, I will have a chaos testing framework that is easy to deploy at the end of September, and we can also use this framework to test cluster‘s stability. Welcome to join me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@iotdb.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org