I started looking at BulidKite and it would solve one large problem for the
DataFusion/Ballista project. We really need to be running integration tests
against large data sets (such as TPC-H @ SF=100GB) and self-hosted
BuildKite makes this simple to accomplish. I even have some modest hardware
that I purchased specifically for this purpose, but I wasn't confident that
I could set this up in a secure way that would protect against malicious
code being submitted. However, if we implement the necessary GitHub hooks
so that these builds only run after a committer adds an "ok to build"
comment then I think it would be fine. This is the approach used in Apache
Spark.

On Thu, Apr 15, 2021 at 3:45 PM Wes McKinney <wesmck...@gmail.com> wrote:

> I think we should take a more serious look at Buildkite for some of our CI.
>
> * First of all, it's very easy to connect self-hosted workers and
> supports ephemeral cloud workers in a way that would be difficult or
> impossible with GHA. No need to have Infra fiddle with the admin
> dashboard. So we could spin up extra workers during peak hours, or use
> autoscaling to respond to demand.
>
> * We can set up more complex / dependent job pipelines rather than the
> current GHA monolithic "long list of independent jobs" setup. For
> example, we could have a fast gatekeeper job for C++ builds (which
> lints and makes sure that everything compiles) that must pass before
> more exhaustive longer-running jobs run.
>
> On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs
> <szucs.kriszt...@gmail.com> wrote:
> >
> > On Thu, Apr 15, 2021 at 2:13 AM Weston Pace <weston.p...@gmail.com>
> wrote:
> > >
> > > It may be worth reaching out to the Airflow project.  Based on
> > >
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > > it seems they have been investing time into figuring how to make
> > > self-hosted runners work (it seems Github's patching model makes this
> > > somewhat difficult).
> >
> > We tried to use github actions self hosted runners previously. Even
> > though Airflow manages to harden the security issues of the self
> > hosted runners (which actually affects all hosted agent based CIs like
> > buildkite as well) registering and managing github agents require
> > admin privileges on the repository, which we don't have.
> > In order to register a github self hosted runner we need to exchange
> > registration tokens with the Apache INFRA team per agent instances.
> > Further issues:
> > - a registration token expires in an hour
> > - troubleshooting the agent<->github communication is not possible
> > without involving additional INFRA roundtrips.
> >
> > >
> > > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou <anto...@python.org>
> wrote:
> > > >
> > > >
> > > > Hi Krisztian,
> > > >
> > > > Thanks for bringing this up.  This is definitely becoming a
> > > > high-priority topic for Arrow development.
> > > >
> > > > I don't believe there is much opportunity for reducing the number of
> > > > builds or their runtime.  We simply have a lot of development going
> on,
> > > > and the number of different CI jobs we have is simply because we
> need to
> > > > support many different configurations (and past experience has shown
> > > > that they quickly stop working if we don't monitor them on a regular
> basis).
> > > >
> > > > So I think the only path forward is to build up (== buy, probably)
> our
> > > > own execution resources for CI.  Whether that entails using Github
> > > > self-hosted runners, Buildkite, or yet another system, I have no
> idea.
> > > >
> > > > I'll submit two requirements though:
> > > > - the configuration for CI builds must be kept in the Arrow
> repository
> > > >    (as they are currently in .github, etc.)
> > > > - CI builds must be runnable from PRs
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > > > Hi,
> > > > >
> > > > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > > > more Apache projects migrate their CI setup to GHA. We experienced
> > > > > pretty solid feedback times (~20-30m) when we originally moved to
> GHA
> > > > > but now we are roughly 5hrs behind [1].
> > > > >
> > > > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > > > have all the links at hand) we can't expect a short term solution
> from
> > > > > infra. I think we *need* to figure out something on the project
> level
> > > > > instead to maintain the overall project health and to improve the
> > > > > development velocity.
> > > > >
> > > > > I don't have a concrete proposal at the moment, but we should
> start to
> > > > > collect the available options. Ideas?
> > > > >
> > > > > Thanks, Krisztian
> > > > >
> > > > > [1]:
> https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > > > > [2]: https://github.com/apache/pulsar/issues/9154
> > > > > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> > > > >
>

Reply via email to