chengjianyun commented on issue #3954: URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920815354
Summary different opinions and reorg the content to make things clear. ## Current situation of cluster availability - https://github.com/apache/iotdb/pull/3939 The bug is found by reviewing code. - https://github.com/apache/iotdb/pull/3930 The bug is found by error msg in a restart fail case. Actually the error msg has been ignored for a long time because server can start normally in most cases. - https://github.com/apache/iotdb/pull/3848 This is a defensive change after reviewing code. - https://github.com/apache/iotdb/pull/3832 The bug is found in a sever can't start case. The bugs are find/fixed by us in recent month. Some of the bugs is found by reviewing code and look like obvious and the other are reported in Raft unrelated issue. I have reasons to believe that there are much more bugs we don't know. Some of the bug may cause data lose, as IoTDB is most used in big data scenario, it's hard to feel when a small part of data lost. That's the worst situation, we don't know when that will happen and what result it will cause. ## Goal of discussion Find a way to improve availability of cluster of IoTDB and keep outstanding performance in a short time(2 months around). ## Assumption The assumptions are the base of the discussion. We don't need to discussion on these any more. 1. Chosen 3rd party library of Raft has been widely used in industry, has high quality and the algorithm is correct in known cases. As has been widely used, I think we could believe `known cases` are `all cases` we could ever have. 2. An application scenario may cause many scenarios in Raft. ## How to achieve the goal ### Import 3rd party Raft framework #### Plan 1. Abstract proper interfaces for Raft, make it a module gradually . 2. At same time, we can investigate some 3rd party libraries and find if we can have a proper one. 3. Import the chosen library to replace our raft module if needed. Of course, things are not as easy as said but it's clear. After this, the tests can be focused on application scenarios which are much less than raft scenarios. #### Advantages - Doable - Raft correctness is guaranteed - Simplify tests, no UT for Raft #### Disadvantages - Maybe lose some performance - Lost control of Raft, update need a relative long term - Can't customize raft to improve performance ### Improvement base on current implementation #### Plan ?? #### Advantages - Could customize raft to optimize performance. - Raft update could be done in a short term. #### Disadvantages - Correctness of Raft is hard to guarantee. - It's hard to make raft tests complete. ## My thought I don't support the solution that improve current implementation because: - I don't know the current tests for raft are complete or not - I don't know how far we are from to make raft test cases complete As TiKV implemented all tests of raft in etcd(after it has been stabilized for years) with RUST, if we could do something similar in a short time, I think its OK improve base on current implementation by ourselves. And optimize raft at this stage is not a good idea, please let's make it correct first. Supplement if I miss anything! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
