Hi Jarek, Thanks for the detailed context and looking forward to the new solution and infra.
Thanks, Ping On Sat, May 14, 2022 at 7:50 AM Jarek Potiuk <[email protected]> wrote: > Yeah. Would be great to figure it out. I also noticed quite a number of > those and they are related to our GitHub Runner infrastructure. For some > reason our runners are more often killed and evicted than it was before so > likely we will need to take a closer look at it. Until it becomes REALLY > annoying, this is a bit time consuming to analyse and look at that - and > usually Ash and myself looked at it when we had a bit of spare time, but > maybe someone from the committers team would like to take a look at it from > the "devopsy" point of view? > > I think it would be great if someone looks at it with a fresh eye, as > having just me and Ash looking at it when we have time to spare is not > nearly good enough and we are two - but still "just two" Points of Failure > . The current solution is kinda complex-ish using a combination of Github > Runner modified by Ash. AWS-specific infrastructure, Dynamo DB to keep > shared authentication information, Auto-Scaling groups, webhook from GitHub > Actions triggering the scaling in/out/, starting Spot Instances as needed > (which can get evicted any time but are 8x cheaper to run), so it might be > some fine tuning (preceded with analysis of what are the root causes for > the failures might be needed). So it requires quite an open-mind on the > tools and technologies used as well as some cloud > management/monitoring/infrastructure devopsing experience. > > Eventually we might want to migrate to a K8S-managed infrastructure as the > Apache Beam team together with ASF Infra (with some of our help and > guidance) works on building a solution that is supposed to be more portable > and easier. So similarly to Python the Breeze and CI Actions rewrite (which > we are finishing) - one of the goals for the infra should be that we have > more people who are involved, know how to fix and run things and make it > more "standard". > > Any volunteers to take a look at the current setup are most welcome. I > think we need a committer, due to sensitivity of the infrastructure access. > > Anyone? Who would like to help here ? > > J. > > > On Sat, May 14, 2022 at 2:09 AM Ping Zhang <[email protected]> wrote: > >> Hi friends, >> >> Recently, I noticed my PRs got lots of this kind of errors: >> >> Some checks were not successful58 successful, 4 skipped, and 1 cancelled >> checks >> >> Tests / Helm Chart Executor Upgrade (pull_request) Cancelled after 104m >> — Helm Chart Executor Upgrade >> >> For example https://github.com/apache/airflow/pull/23655 and >> https://github.com/apache/airflow/pull/23684, and I had to force push >> many times. >> >> I am wondering what causes this and how I can avoid this error. >> >> Thanks, >> >> Ping >> >
