durgaprasadml opened a new pull request, #38753: URL: https://github.com/apache/beam/pull/38753
## Summary This PR stabilizes the PostCommit Java ValidatesRunner Dataflow Streaming workflow, which is currently failing more than 50% of the time due to infrastructure contention, legacy streaming runner limitations, and overly aggressive streaming job cancellation behavior. Fixes #38710 --- ## Root Causes Identified ### 1. Unbounded Parallelism / Resource Exhaustion The validatesRunner tasks were configured with: groovy id="n9b5m2" maxParallelForks Integer.MAX_VALUE Combined with GitHub Actions max-workers: 12, this could launch up to 12 concurrent Dataflow streaming jobs simultaneously. This frequently exhausted: - Compute Engine IP quotas - CPUs - concurrent Dataflow job quotas - self-hosted runner resources leading to worker startup starvation and test timeouts. --- ### 2. Legacy Streaming Worker Non-Termination The workflow previously used the legacy VM-based streaming execution path: bash id="ewl1pu" :runners:google-cloud-dataflow-java:validatesRunnerStreaming Bounded streaming pipelines under the legacy runner often failed to terminate automatically, remaining in RUNNING state until the 15-minute timeout cancelled them. --- ### 3. Aggressive Failure Cancellation TestDataflowRunner immediately cancelled jobs upon encountering any JOB_MESSAGE_ERROR, even for transient worker/network issues that Dataflow could automatically recover from. This caused false-negative failures in CI. --- ## Changes Implemented ### Throttle validatesRunner concurrency Reduced validatesRunner concurrency to: groovy id="pvb0u9" maxParallelForks = 4 with support for overriding via: bash id="6a4s1z" -PmaxParallelForks=<n> This reduces quota pressure and runner overload. --- ### Migrate workflow to Streaming Engine Updated the workflow to run: bash id="0kcg4q" :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine Benefits: - faster startup - improved bounded-source termination - reduced infrastructure overhead - improved stability --- ### Add metrics-driven early termination Enhanced TestDataflowRunner to continuously poll: - PAssertSuccess - PAssertFailure during streaming execution. Behavior: - early cancel on assertion success - early cancel on assertion failure This reduces successful test runtime from ~15 minutes to ~2–3 minutes. --- ### Delay cancellation on transient worker errors Added a recovery window before cancelling jobs due to transient JOB_MESSAGE_ERROR entries, allowing Dataflow retries and self-healing to stabilize the pipeline. --- ### Add Gradle test retry support Integrated the org.gradle.test-retry plugin for CI integration tests to reduce transient infrastructure-related failures. --- ## Validation Added/updated tests covering: - streaming early-success termination - streaming early-failure termination - metrics polling behavior Verification command: bash id="j7bw3w" ./gradlew :runners:google-cloud-dataflow-java:test \ --tests "org.apache.beam.runners.dataflow.TestDataflowRunnerTest" Streaming validation command: bash id="o3llzt" ./gradlew :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine \ -PtestFilter="org.apache.beam.sdk.transforms.GroupByKeyTest" --- ## Expected Impact These changes are expected to: - significantly reduce CI flakiness - reduce GCP quota pressure - improve workflow runtime stability - shorten streaming test execution time - improve overall validatesRunner reliability -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
