Hello community, It's been a while since the discussion on the Celeborn chaos testing framework. The main process of Celeborn chaos testing includes:
1. Defining a test plan to describe the types of events, the order in which events are triggered, and their duration. Event types include node anomalies, disk anomalies, IO anomalies, CPU overload, etc. 2. The client submits the plan to the scheduler. 3. The scheduler sends operations to each node's runner according to the plan description. 4. The runner is responsible for executing the operations and reporting the current status of the node. 5. Before triggering an operation, the scheduler deduces the result of this event. If it leads to the inability to meet the minimum runnable environment for RSS, the event is rejected. Do you have any thoughts or questions about this chaos testing framework? Welcome feedback to further ensure the reliability of Celeborn through chaos testing. Regards, Nicholas Jiang At 2024-07-03 05:20:57, "Nicholas Jiang" <[email protected]> wrote: >Hi all, > >I would like to start a discussion on CIP-10: Introduce Celeborn Chaos Testing >Framework[1]. > >A chaos testing framework is designed to simulate unpredictable and adverse >conditions in distributed systems to validate their robustness and resilience. >This proposal aims to simulate various anomalies and test the stability of >Celeborn in distributed environments via chaos testing. > >Looking forward to everyone's feedback and suggestions. Thank you! > >[1] >https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework > >Regards, >Nicholas Jiang
