milenkovicm opened a new pull request, #1789:
URL: https://github.com/apache/datafusion-ballista/pull/1789

   ## Which issue does this PR close?
   
   Closes #.
   
   ## Rationale for this change
   
   Distributed query engines like Ballista are complex systems where failures 
can occur at many points — network partitions, executor crashes, resource 
exhaustion. Without a way to deliberately inject failures during testing, it is 
difficult to verify that the system recovers correctly and produces consistent 
results under partial failure conditions. This PR introduces a chaos monkey 
mechanism to enable controlled, reproducible robustness testing of Ballista's 
distributed execution pipeline.
   
   Chaos testing is a well-established technique (popularized by Netflix's 
Chaos Monkey) for proactively exposing weaknesses in distributed systems by 
randomly inducing failures in a controlled environment. By injecting failures 
at the physical plan level, we can verify retry logic, fault tolerance, and 
error propagation across executor nodes without relying on external fault 
injection tools.
   
   ## What changes are included in this PR?
   Adds `ChaosExec` (`ballista/core/src/execution_plans/chaos_exec.rs`) — a 
physical execution plan wrapper node that randomly fails with a configurable 
probability. It is transparent to the plan: it preserves the schema and output 
partitioning of its input node.
   
   Adds `ChaosExecRule` 
(`ballista/scheduler/src/state/aqe/optimizer_rule/chaos_exec.rs`) — an AQE 
optimizer rule that randomly selects one node in the physical plan and wraps it 
with ChaosExec. Exactly one injection per optimize call.
   Adds protobuf definition and serde support for ChaosExec so it can be 
serialized and sent to executor nodes.
   
   Adds two new BallistaConfig keys:
   
   - `ballista.testing.chaos_execution.enabled` - (bool, default false)
   - `ballista.testing.chaos_execution.probability` - (f64 in [0.0, 1.0], 
default 0.25)
   
   ## Are there any user-facing changes?
   
   Two new opt-in configuration keys are introduced, both disabled by default:
   
   - `ballista.testing.chaos_execution.enabled` — must be explicitly set to 
true to activate chaos injection.
   - `ballista.testing.chaos_execution.probability` — controls the per-node 
failure probability when enabled.
   
   These keys are scoped under `ballista.testing.*` to signal they are for 
testing environments only and should not be enabled in production workloads. No 
existing behavior is affected when the feature is disabled.
   
   ## Context 
   
   depends on #1752 which should be merged first
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to