On Thu, Apr 15, 2021 at 11:53 PM Andy Grove <andygrov...@gmail.com> wrote: > > I started looking at BulidKite and it would solve one large problem for the > DataFusion/Ballista project. We really need to be running integration tests > against large data sets (such as TPC-H @ SF=100GB) and self-hosted > BuildKite makes this simple to accomplish. I even have some modest hardware > that I purchased specifically for this purpose, but I wasn't confident that > I could set this up in a secure way that would protect against malicious > code being submitted. However, if we implement the necessary GitHub hooks We don't need additional hooks for this particular use case, see explanation below. Although INFRA needs to configure hooks for each repository we want to get commit events from. For apache/arrow we have already hooked up a buildkite instance at [3], this should be done for the new repositories as well.
> so that these builds only run after a committer adds an "ok to build" > comment then I think it would be fine. This is the approach used in Apache > Spark. The build needs to query the pull request data from the github API (since the event payload is not available by default on BK). There is a field called author association [2] which contains the necessary information to decide whether a pull request's author is trustworthy. We already use the same mechanism [1] to handle the comment bot (@github-actions) requests. Therefore we don't need to explicitly mark a PR as "ok to build" sparing a manual step. [1]: https://github.com/apache/arrow/blob/master/dev/archery/archery/bot.py#L98 [2]: https://docs.github.com/en/graphql/reference/enums#commentauthorassociation [3]: https://buildkite.com/apache-arrow > > On Thu, Apr 15, 2021 at 3:45 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > I think we should take a more serious look at Buildkite for some of our CI. > > > > * First of all, it's very easy to connect self-hosted workers and > > supports ephemeral cloud workers in a way that would be difficult or > > impossible with GHA. No need to have Infra fiddle with the admin > > dashboard. So we could spin up extra workers during peak hours, or use > > autoscaling to respond to demand. > > > > * We can set up more complex / dependent job pipelines rather than the > > current GHA monolithic "long list of independent jobs" setup. For > > example, we could have a fast gatekeeper job for C++ builds (which > > lints and makes sure that everything compiles) that must pass before > > more exhaustive longer-running jobs run. > > > > On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs > > <szucs.kriszt...@gmail.com> wrote: > > > > > > On Thu, Apr 15, 2021 at 2:13 AM Weston Pace <weston.p...@gmail.com> > > wrote: > > > > > > > > It may be worth reaching out to the Airflow project. Based on > > > > > > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status > > > > it seems they have been investing time into figuring how to make > > > > self-hosted runners work (it seems Github's patching model makes this > > > > somewhat difficult). > > > > > > We tried to use github actions self hosted runners previously. Even > > > though Airflow manages to harden the security issues of the self > > > hosted runners (which actually affects all hosted agent based CIs like > > > buildkite as well) registering and managing github agents require > > > admin privileges on the repository, which we don't have. > > > In order to register a github self hosted runner we need to exchange > > > registration tokens with the Apache INFRA team per agent instances. > > > Further issues: > > > - a registration token expires in an hour > > > - troubleshooting the agent<->github communication is not possible > > > without involving additional INFRA roundtrips. > > > > > > > > > > > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > > > > > Hi Krisztian, > > > > > > > > > > Thanks for bringing this up. This is definitely becoming a > > > > > high-priority topic for Arrow development. > > > > > > > > > > I don't believe there is much opportunity for reducing the number of > > > > > builds or their runtime. We simply have a lot of development going > > on, > > > > > and the number of different CI jobs we have is simply because we > > need to > > > > > support many different configurations (and past experience has shown > > > > > that they quickly stop working if we don't monitor them on a regular > > basis). > > > > > > > > > > So I think the only path forward is to build up (== buy, probably) > > our > > > > > own execution resources for CI. Whether that entails using Github > > > > > self-hosted runners, Buildkite, or yet another system, I have no > > idea. > > > > > > > > > > I'll submit two requirements though: > > > > > - the configuration for CI builds must be kept in the Arrow > > repository > > > > > (as they are currently in .github, etc.) > > > > > - CI builds must be runnable from PRs > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit : > > > > > > Hi, > > > > > > > > > > > > The Apache Github Actions agent pool seems to be oversubscribed as > > > > > > more Apache projects migrate their CI setup to GHA. We experienced > > > > > > pretty solid feedback times (~20-30m) when we originally moved to > > GHA > > > > > > but now we are roughly 5hrs behind [1]. > > > > > > > > > > > > Based on other projects' complaints and discussions [2][3] (doesn't > > > > > > have all the links at hand) we can't expect a short term solution > > from > > > > > > infra. I think we *need* to figure out something on the project > > level > > > > > > instead to maintain the overall project health and to improve the > > > > > > development velocity. > > > > > > > > > > > > I don't have a concrete proposal at the moment, but we should > > start to > > > > > > collect the available options. Ideas? > > > > > > > > > > > > Thanks, Krisztian > > > > > > > > > > > > [1]: > > https://github.com/apache/arrow/actions?query=is%3Ain_progress > > > > > > [2]: https://github.com/apache/pulsar/issues/9154 > > > > > > [3]: https://issues.apache.org/jira/browse/SPARK-34053 > > > > > > > >