Hello community,

It's been a while since the discussion on the Celeborn chaos testing framework. 
The main process of Celeborn chaos testing includes:

1. Defining a test plan to describe the types of events, the order in which 
events are triggered, and their duration. Event types include node anomalies, 
disk anomalies, IO anomalies, CPU overload, etc.
2. The client submits the plan to the scheduler.
3. The scheduler sends operations to each node's runner according to the plan 
description.
4. The runner is responsible for executing the operations and reporting the 
current status of the node.
5. Before triggering an operation, the scheduler deduces the result of this 
event. If it leads to the inability to meet the minimum runnable environment 
for RSS, the event is rejected.

Do you have any thoughts or questions about this chaos testing framework? 
Welcome feedback to further ensure the reliability of Celeborn through chaos 
testing.

Regards,
Nicholas Jiang

At 2024-07-03 05:20:57, "Nicholas Jiang" <[email protected]> wrote:
>Hi all,
>
>I would like to start a discussion on CIP-10: Introduce Celeborn Chaos Testing 
>Framework[1].
>
>A chaos testing framework is designed to simulate unpredictable and adverse 
>conditions in distributed systems to validate their robustness and resilience. 
>This proposal aims to simulate various anomalies and test the stability of 
>Celeborn in distributed environments via chaos testing.
>
>Looking forward to everyone's feedback and suggestions. Thank you!
>
>[1] 
>https://cwiki.apache.org/confluence/display/CELEBORN/CIP-10+Introduce+Celeborn+Chaos+Testing+Framework
>
>Regards,
>Nicholas Jiang

Reply via email to