Yeah. Would be great to figure it out. I also noticed quite a number of
those and they are related to our GitHub Runner infrastructure. For some
reason our runners are more often killed and evicted than it was before so
likely we will need to take a closer look at it. Until it becomes REALLY
annoying, this is a bit  time consuming to analyse and look at that - and
usually Ash and myself looked at it when we had a bit of spare time, but
maybe someone from the committers team would like to take a look at it from
the "devopsy" point of view?

I think it would be great if someone looks at it with a fresh eye, as
having just me and Ash looking at it when we have time to spare is not
nearly good enough and we are two  - but still "just two" Points of Failure
. The current solution is kinda complex-ish using a combination of Github
Runner modified by Ash.  AWS-specific infrastructure, Dynamo DB to keep
shared authentication information, Auto-Scaling groups, webhook from GitHub
Actions triggering the scaling in/out/, starting Spot Instances as needed
(which can get evicted any time but are 8x cheaper to run), so it might be
some fine tuning (preceded with analysis of what are the root causes for
the failures might be needed). So it requires quite an open-mind on the
tools and technologies used as well as some cloud
management/monitoring/infrastructure devopsing experience.

Eventually we might want to migrate to a K8S-managed infrastructure as the
Apache Beam team  together with ASF Infra (with some of our help and
guidance) works on building a solution that is supposed to be more portable
and easier. So similarly to Python the Breeze and CI Actions rewrite (which
we are finishing) - one of the goals for the infra should be that we have
more people who are involved, know how to fix and run things and make it
more "standard".

Any volunteers to take a look at the current setup are most welcome. I
think we need a committer, due to sensitivity of the infrastructure access.

Anyone? Who would like to help here ?

J.


On Sat, May 14, 2022 at 2:09 AM Ping Zhang <[email protected]> wrote:

> Hi friends,
>
> Recently, I noticed my PRs got lots of this kind of errors:
>
> Some checks were not successful58 successful, 4 skipped, and 1 cancelled
> checks
>
> Tests / Helm Chart Executor Upgrade (pull_request) Cancelled after 104m —
> Helm Chart Executor Upgrade
>
> For example https://github.com/apache/airflow/pull/23655 and
> https://github.com/apache/airflow/pull/23684, and I had to force push
> many times.
>
> I am wondering what causes this and how I can avoid this error.
>
> Thanks,
>
> Ping
>

Reply via email to