durgaprasadml opened a new pull request, #38752: URL: https://github.com/apache/beam/pull/38752
Closes #38693 ### Summary This PR improves the reliability of the Playground CI Nightly workflow by addressing two recurring sources of flakiness observed in scheduled runs. ### Root Causes #### 1. GitHub API rate limiting The workflow retrieves the latest Beam release using an unauthenticated GitHub API request: bash id="i6p9qh" https://api.github.com/repos/apache/beam/releases/latest Under heavy CI activity, unauthenticated requests may hit GitHub REST API rate limits, causing invalid responses or null release tags. This results in downstream failures while resolving SDK versions and Docker images. #### 2. Backend startup race condition The workflow launches Playground backend containers in detached mode and immediately starts the CI runner against port 8080. Under slower runner conditions or higher system load, the backend service may not yet be ready to accept connections, causing intermittent connection failures and flaky CI runs. --- ## Changes ### GitHub API hardening - Authenticate release API requests using ${{ secrets.GITHUB_TOKEN }} - Add fallback handling when API responses are invalid or empty - Add git tag fallback resolution if the API request fails ### Backend readiness validation - Add explicit backend readiness polling before running CI tests - Poll backend port 8080 with bounded retry logic - Fail fast with container diagnostics when startup fails ### Observability improvements - Print container logs on startup timeout - Print container inspect information for easier debugging --- ## Validation Validated by: - testing authenticated API resolution, - verifying fallback behavior for invalid API responses, - repeated workflow simulations with delayed container startup, - confirming readiness polling succeeds before CI execution. The readiness polling exits immediately once the backend becomes available, minimizing runtime overhead on successful runs. --- ## Expected Impact This change reduces intermittent Playground CI Nightly failures caused by: - transient GitHub API failures, - release tag resolution issues, - backend container initialization timing races. The implementation improves determinism and observability while keeping workflow changes minimal and maintainable. --- ## Testing Verified: - workflow YAML validation, - backend readiness polling behavior, - fallback release tag resolution logic, - failure diagnostics output. No production runtime code paths were modified outside CI workflows. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
