Yeah. Would be great to figure it out. I also noticed quite a number of those and they are related to our GitHub Runner infrastructure. For some reason our runners are more often killed and evicted than it was before so likely we will need to take a closer look at it. Until it becomes REALLY annoying, this is a bit time consuming to analyse and look at that - and usually Ash and myself looked at it when we had a bit of spare time, but maybe someone from the committers team would like to take a look at it from the "devopsy" point of view?
I think it would be great if someone looks at it with a fresh eye, as having just me and Ash looking at it when we have time to spare is not nearly good enough and we are two - but still "just two" Points of Failure . The current solution is kinda complex-ish using a combination of Github Runner modified by Ash. AWS-specific infrastructure, Dynamo DB to keep shared authentication information, Auto-Scaling groups, webhook from GitHub Actions triggering the scaling in/out/, starting Spot Instances as needed (which can get evicted any time but are 8x cheaper to run), so it might be some fine tuning (preceded with analysis of what are the root causes for the failures might be needed). So it requires quite an open-mind on the tools and technologies used as well as some cloud management/monitoring/infrastructure devopsing experience. Eventually we might want to migrate to a K8S-managed infrastructure as the Apache Beam team together with ASF Infra (with some of our help and guidance) works on building a solution that is supposed to be more portable and easier. So similarly to Python the Breeze and CI Actions rewrite (which we are finishing) - one of the goals for the infra should be that we have more people who are involved, know how to fix and run things and make it more "standard". Any volunteers to take a look at the current setup are most welcome. I think we need a committer, due to sensitivity of the infrastructure access. Anyone? Who would like to help here ? J. On Sat, May 14, 2022 at 2:09 AM Ping Zhang <[email protected]> wrote: > Hi friends, > > Recently, I noticed my PRs got lots of this kind of errors: > > Some checks were not successful58 successful, 4 skipped, and 1 cancelled > checks > > Tests / Helm Chart Executor Upgrade (pull_request) Cancelled after 104m — > Helm Chart Executor Upgrade > > For example https://github.com/apache/airflow/pull/23655 and > https://github.com/apache/airflow/pull/23684, and I had to force push > many times. > > I am wondering what causes this and how I can avoid this error. > > Thanks, > > Ping >
