LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517


   > You may think it is easier to switch to a new frame than to improve the 
current one, but it is totally not the case. Implementing a distributed version 
involves much more than calling some methods in a library. Interface adaption, 
schema conversion, exception handling, cluster organization, and data 
distribution... there are so many to do.
   
   +1 
   
   The goal for all of us was to make the  cluster module more stable: some 
poeple felt it was better to use a mature Raft library because Raft was hard to 
be implemented correctly, but I also observed 
[Kafka](https://github.com/apache/kafka/tree/trunk/raft) writing their own Raft 
instead of using other Raft libraries. Other people don't feel the need to make 
big changes right now because they don't seem to have any problems at the 
moment, but using a great Raft library like etcd frees up the consensus 
bottleneck of the entire cloud native.
   
   Being pragmatic, we have to admit that even if we decided to use another 
raft library, it wouldn't have been possible in a month or two. Also, most of 
the bugs we've fixed so far have nothing to do with consensus. Therefore, the 
decision to move precipitously requires a great deal of risk.
   
   So I suggest we go in three directions in parallel: research + refactoring + 
testing.
   
   - Research: Follow my list of 5 concerns to see how other libraries are 
doing; Understanding these things will help us understand the Raft algorithm 
more deeply. Whether the Raft library is replaced or not, this is better for 
cluster IoTDB because the people who developed it know more about Raft.
   
   - Refactoring: Since we've recently started refactoring cluster code, I 
thought we could refactor Raft code as well. Ideally, it should be a single 
module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For 
anyone who wants to change raft library, this is basically the process of 
changing library, and we must to do some abstraction in order to change current 
Raft algorithm. For those who don't want to change raft library, doing so can 
improve code readability and make it easier to add more complex tests. After 
modularization, we can enumerate some performance comparisons and pros and cons 
before we discuss whether raft libraries need to be replaced. I don't think we 
would be as divergent as we are now.
   
   - Testing: After nearly a year of maintaining cluster modules, I believe 
that most of the bugs fixed so far have nothing to do with consensus and will 
show up even replacing raft library. Of course, I'm not saying that the Raft we 
implemented currently had no bugs, it was probably due to a lack of testing and 
a lack of production cases. Therefore, I suggest that we can fully test the 
cluster module from now on, and according to the test results we can make the 
next step of judgment. In addition, I am currently investigating and designing 
cluster [chaos-test 
framework](https://gitlab.summer-ospp.ac.cn/summer2021/210070607). If 
everything goes well, I will have a chaos testing framework that is easy to 
deploy at the end of September, and we can also use this framework to test 
cluster‘s stability. Welcome to join me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to