durgaprasadml commented on issue #38710:
URL: https://github.com/apache/beam/issues/38710#issuecomment-4583590691

   ### Summary of Investigation and Stabilization Plan for #38710
   
   I investigated the flakiness of the PostCommit Java ValidatesRunner Dataflow 
Streaming workflow and identified several structural causes contributing to the 
>50% failure rate.
   
   ### Root Causes Identified
   
   1. Unthrottled Parallelism / GCP Quota Exhaustion
      - The validatesRunner task currently runs with:
        groovy      maxParallelForks Integer.MAX_VALUE      
      - Combined with GitHub Actions max-workers: 12, this can launch up to 12 
concurrent Dataflow streaming jobs.
      - This frequently exhausts GCP quotas/resources (Compute Engine IPs, 
CPUs, concurrent Dataflow jobs) and overloads the self-hosted runner.
   
   2. Legacy Streaming Worker Non-Termination
      - The workflow currently uses the legacy VM-based streaming path.
      - Bounded streaming pipelines do not reliably terminate automatically 
under the legacy runner, causing jobs to remain in RUNNING until the 15-minute 
timeout expires.
   
   3. Overly Aggressive Failure Detection
      - TestDataflowRunner immediately cancels jobs upon encountering any 
JOB_MESSAGE_ERROR.
      - In distributed streaming environments, transient worker/network errors 
are expected and are often automatically recovered by Dataflow retries.
      - This behavior causes false negatives and unnecessary test failures.
   
   ### Implemented Fixes
   
   #### 1. Throttled Concurrency
   Reduced validatesRunner parallelism from unbounded forks to:
   groovy maxParallelForks = 4 
   
   This significantly reduces:
   - quota exhaustion,
   - worker startup starvation,
   - and CI runner overload.
   
   #### 2. Migration to Dataflow Streaming Engine
   Updated the workflow to use:
   bash :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine 
   
   instead of the legacy streaming path.
   
   Benefits:
   - faster worker startup,
   - better bounded-source termination behavior,
   - lower infrastructure overhead,
   - improved stability.
   
   #### 3. Metrics-Driven Early Success/Failure Cancellation
   Enhanced TestDataflowRunner.java to poll:
   - PAssertSuccess
   - PAssertFailure
   
   during streaming execution.
   
   Behavior:
   - If assertions succeed → cancel immediately as success.
   - If assertions fail → cancel immediately as failure.
   
   This reduces successful test runtime from ~15 minutes to ~2–3 minutes and 
lowers flakiness exposure.
   
   #### 4. Delayed Cancellation on Transient Errors
   Modified the background cancellation logic to allow a recovery window before 
aborting on transient JOB_MESSAGE_ERROR entries.
   
   This allows Dataflow self-healing/retries to stabilize the pipeline before 
the test is failed.
   
   #### 5. Gradle Test Retry Support
   Added the org.gradle.test-retry plugin for CI integration tests to 
automatically retry transient infrastructure failures.
   
   ### Validation
   
   Added/updated tests for:
   - early success termination,
   - early failure termination,
   - streaming metrics polling behavior.
   
   Suggested verification command:
   bash ./gradlew :runners:google-cloud-dataflow-java:test \   --tests 
"org.apache.beam.runners.dataflow.TestDataflowRunnerTest" 
   
   ### Expected Outcome
   
   These changes should:
   - drastically reduce workflow flakiness,
   - reduce GCP resource pressure,
   - shorten overall workflow runtime,
   - improve streaming test reliability,
   - and stabilize CI execution behavior.
   
   I’ve prepared the implementation changes and validation updates for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to