durgaprasadml opened a new pull request, #38752:
URL: https://github.com/apache/beam/pull/38752

   Closes #38693
   
   ### Summary
   
   This PR improves the reliability of the Playground CI Nightly workflow by 
addressing two recurring sources of flakiness observed in scheduled runs.
   
   ### Root Causes
   
   #### 1. GitHub API rate limiting
   
   The workflow retrieves the latest Beam release using an unauthenticated 
GitHub API request:
   
   bash id="i6p9qh" https://api.github.com/repos/apache/beam/releases/latest 
   
   Under heavy CI activity, unauthenticated requests may hit GitHub REST API 
rate limits, causing invalid responses or null release tags. This results in 
downstream failures while resolving SDK versions and Docker images.
   
   #### 2. Backend startup race condition
   
   The workflow launches Playground backend containers in detached mode and 
immediately starts the CI runner against port 8080.
   
   Under slower runner conditions or higher system load, the backend service 
may not yet be ready to accept connections, causing intermittent connection 
failures and flaky CI runs.
   
   ---
   
   ## Changes
   
   ### GitHub API hardening
   
   - Authenticate release API requests using ${{ secrets.GITHUB_TOKEN }}
   - Add fallback handling when API responses are invalid or empty
   - Add git tag fallback resolution if the API request fails
   
   ### Backend readiness validation
   
   - Add explicit backend readiness polling before running CI tests
   - Poll backend port 8080 with bounded retry logic
   - Fail fast with container diagnostics when startup fails
   
   ### Observability improvements
   
   - Print container logs on startup timeout
   - Print container inspect information for easier debugging
   
   ---
   
   ## Validation
   
   Validated by:
   - testing authenticated API resolution,
   - verifying fallback behavior for invalid API responses,
   - repeated workflow simulations with delayed container startup,
   - confirming readiness polling succeeds before CI execution.
   
   The readiness polling exits immediately once the backend becomes available, 
minimizing runtime overhead on successful runs.
   
   ---
   
   ## Expected Impact
   
   This change reduces intermittent Playground CI Nightly failures caused by:
   - transient GitHub API failures,
   - release tag resolution issues,
   - backend container initialization timing races.
   
   The implementation improves determinism and observability while keeping 
workflow changes minimal and maintainable.
   
   ---
   
   ## Testing
   
   Verified:
   - workflow YAML validation,
   - backend readiness polling behavior,
   - fallback release tag resolution logic,
   - failure diagnostics output.
   
   No production runtime code paths were modified outside CI workflows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to