aIbrahiim commented on code in PR #37989:
URL: https://github.com/apache/beam/pull/37989#discussion_r3017777876
##########
.github/workflows/beam_PreCommit_Python.yml:
##########
@@ -111,12 +111,14 @@ jobs:
env:
TOX_TESTENV_PASSENV:
"DOCKER_*,TESTCONTAINERS_*,TC_*,BEAM_*,GRPC_*,OMP_*,OPENBLAS_*,PYTHONHASHSEED,PYTEST_*"
# Aggressive retry and timeout settings for flaky CI
- PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5
--reruns-delay=15 --timeout=600 --disable-warnings"
+ PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5
--reruns-delay=15 --timeout=900 --disable-warnings"
# Container stability - much more generous timeouts
TC_TIMEOUT: "300"
TC_MAX_TRIES: "15"
TC_SLEEP_TIME: "5"
# Additional gRPC stability for flaky environment
+ GRPC_ARG_KEEPALIVE_TIME_MS: "60000"
Review Comment:
rom my side, SSH into a stuck self-hosted runner is possible probably only
if we have runner access granted so it’s not always the easiest path to start
with and locally I couldn’t reproduce yet due to WSL network limits (can’t
reach plugins.gradle.org/pypi) so I dont have a solid repro rate from local
runs so given that the most straightforward option is to add temporary CI
instrumentation (faulthandler + periodic pystack dumps uploaded as artifacts)
so every stuck run automatically gives us stack traces
##########
.github/workflows/beam_PreCommit_Python.yml:
##########
@@ -111,12 +111,14 @@ jobs:
env:
TOX_TESTENV_PASSENV:
"DOCKER_*,TESTCONTAINERS_*,TC_*,BEAM_*,GRPC_*,OMP_*,OPENBLAS_*,PYTHONHASHSEED,PYTEST_*"
# Aggressive retry and timeout settings for flaky CI
- PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5
--reruns-delay=15 --timeout=600 --disable-warnings"
+ PYTEST_ADDOPTS: "-v --tb=short --maxfail=5 --durations=30 --reruns=5
--reruns-delay=15 --timeout=900 --disable-warnings"
# Container stability - much more generous timeouts
TC_TIMEOUT: "300"
TC_MAX_TRIES: "15"
TC_SLEEP_TIME: "5"
# Additional gRPC stability for flaky environment
+ GRPC_ARG_KEEPALIVE_TIME_MS: "60000"
Review Comment:
from my side, SSH into a stuck self-hosted runner is possible probably only
if we have runner access granted so it’s not always the easiest path to start
with and locally I couldn’t reproduce yet due to WSL network limits (can’t
reach plugins.gradle.org/pypi) so I dont have a solid repro rate from local
runs so given that the most straightforward option is to add temporary CI
instrumentation (faulthandler + periodic pystack dumps uploaded as artifacts)
so every stuck run automatically gives us stack traces
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]