milenkovicm opened a new pull request, #1789: URL: https://github.com/apache/datafusion-ballista/pull/1789
## Which issue does this PR close? Closes #. ## Rationale for this change Distributed query engines like Ballista are complex systems where failures can occur at many points — network partitions, executor crashes, resource exhaustion. Without a way to deliberately inject failures during testing, it is difficult to verify that the system recovers correctly and produces consistent results under partial failure conditions. This PR introduces a chaos monkey mechanism to enable controlled, reproducible robustness testing of Ballista's distributed execution pipeline. Chaos testing is a well-established technique (popularized by Netflix's Chaos Monkey) for proactively exposing weaknesses in distributed systems by randomly inducing failures in a controlled environment. By injecting failures at the physical plan level, we can verify retry logic, fault tolerance, and error propagation across executor nodes without relying on external fault injection tools. ## What changes are included in this PR? Adds `ChaosExec` (`ballista/core/src/execution_plans/chaos_exec.rs`) — a physical execution plan wrapper node that randomly fails with a configurable probability. It is transparent to the plan: it preserves the schema and output partitioning of its input node. Adds `ChaosExecRule` (`ballista/scheduler/src/state/aqe/optimizer_rule/chaos_exec.rs`) — an AQE optimizer rule that randomly selects one node in the physical plan and wraps it with ChaosExec. Exactly one injection per optimize call. Adds protobuf definition and serde support for ChaosExec so it can be serialized and sent to executor nodes. Adds two new BallistaConfig keys: - `ballista.testing.chaos_execution.enabled` - (bool, default false) - `ballista.testing.chaos_execution.probability` - (f64 in [0.0, 1.0], default 0.25) ## Are there any user-facing changes? Two new opt-in configuration keys are introduced, both disabled by default: - `ballista.testing.chaos_execution.enabled` — must be explicitly set to true to activate chaos injection. - `ballista.testing.chaos_execution.probability` — controls the per-node failure probability when enabled. These keys are scoped under `ballista.testing.*` to signal they are for testing environments only and should not be enabled in production workloads. No existing behavior is affected when the feature is disabled. ## Context depends on #1752 which should be merged first -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
