durgaprasadml commented on issue #38710:
URL: https://github.com/apache/beam/issues/38710#issuecomment-4583590691
### Summary of Investigation and Stabilization Plan for #38710
I investigated the flakiness of the PostCommit Java ValidatesRunner Dataflow
Streaming workflow and identified several structural causes contributing to the
>50% failure rate.
### Root Causes Identified
1. Unthrottled Parallelism / GCP Quota Exhaustion
- The validatesRunner task currently runs with:
groovy maxParallelForks Integer.MAX_VALUE
- Combined with GitHub Actions max-workers: 12, this can launch up to 12
concurrent Dataflow streaming jobs.
- This frequently exhausts GCP quotas/resources (Compute Engine IPs,
CPUs, concurrent Dataflow jobs) and overloads the self-hosted runner.
2. Legacy Streaming Worker Non-Termination
- The workflow currently uses the legacy VM-based streaming path.
- Bounded streaming pipelines do not reliably terminate automatically
under the legacy runner, causing jobs to remain in RUNNING until the 15-minute
timeout expires.
3. Overly Aggressive Failure Detection
- TestDataflowRunner immediately cancels jobs upon encountering any
JOB_MESSAGE_ERROR.
- In distributed streaming environments, transient worker/network errors
are expected and are often automatically recovered by Dataflow retries.
- This behavior causes false negatives and unnecessary test failures.
### Implemented Fixes
#### 1. Throttled Concurrency
Reduced validatesRunner parallelism from unbounded forks to:
groovy maxParallelForks = 4
This significantly reduces:
- quota exhaustion,
- worker startup starvation,
- and CI runner overload.
#### 2. Migration to Dataflow Streaming Engine
Updated the workflow to use:
bash :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine
instead of the legacy streaming path.
Benefits:
- faster worker startup,
- better bounded-source termination behavior,
- lower infrastructure overhead,
- improved stability.
#### 3. Metrics-Driven Early Success/Failure Cancellation
Enhanced TestDataflowRunner.java to poll:
- PAssertSuccess
- PAssertFailure
during streaming execution.
Behavior:
- If assertions succeed → cancel immediately as success.
- If assertions fail → cancel immediately as failure.
This reduces successful test runtime from ~15 minutes to ~2–3 minutes and
lowers flakiness exposure.
#### 4. Delayed Cancellation on Transient Errors
Modified the background cancellation logic to allow a recovery window before
aborting on transient JOB_MESSAGE_ERROR entries.
This allows Dataflow self-healing/retries to stabilize the pipeline before
the test is failed.
#### 5. Gradle Test Retry Support
Added the org.gradle.test-retry plugin for CI integration tests to
automatically retry transient infrastructure failures.
### Validation
Added/updated tests for:
- early success termination,
- early failure termination,
- streaming metrics polling behavior.
Suggested verification command:
bash ./gradlew :runners:google-cloud-dataflow-java:test \ --tests
"org.apache.beam.runners.dataflow.TestDataflowRunnerTest"
### Expected Outcome
These changes should:
- drastically reduce workflow flakiness,
- reduce GCP resource pressure,
- shorten overall workflow runtime,
- improve streaming test reliability,
- and stabilize CI execution behavior.
I’ve prepared the implementation changes and validation updates for review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]