I started looking at BulidKite and it would solve one large problem for the DataFusion/Ballista project. We really need to be running integration tests against large data sets (such as TPC-H @ SF=100GB) and self-hosted BuildKite makes this simple to accomplish. I even have some modest hardware that I purchased specifically for this purpose, but I wasn't confident that I could set this up in a secure way that would protect against malicious code being submitted. However, if we implement the necessary GitHub hooks so that these builds only run after a committer adds an "ok to build" comment then I think it would be fine. This is the approach used in Apache Spark.
On Thu, Apr 15, 2021 at 3:45 PM Wes McKinney <wesmck...@gmail.com> wrote: > I think we should take a more serious look at Buildkite for some of our CI. > > * First of all, it's very easy to connect self-hosted workers and > supports ephemeral cloud workers in a way that would be difficult or > impossible with GHA. No need to have Infra fiddle with the admin > dashboard. So we could spin up extra workers during peak hours, or use > autoscaling to respond to demand. > > * We can set up more complex / dependent job pipelines rather than the > current GHA monolithic "long list of independent jobs" setup. For > example, we could have a fast gatekeeper job for C++ builds (which > lints and makes sure that everything compiles) that must pass before > more exhaustive longer-running jobs run. > > On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs > <szucs.kriszt...@gmail.com> wrote: > > > > On Thu, Apr 15, 2021 at 2:13 AM Weston Pace <weston.p...@gmail.com> > wrote: > > > > > > It may be worth reaching out to the Airflow project. Based on > > > > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status > > > it seems they have been investing time into figuring how to make > > > self-hosted runners work (it seems Github's patching model makes this > > > somewhat difficult). > > > > We tried to use github actions self hosted runners previously. Even > > though Airflow manages to harden the security issues of the self > > hosted runners (which actually affects all hosted agent based CIs like > > buildkite as well) registering and managing github agents require > > admin privileges on the repository, which we don't have. > > In order to register a github self hosted runner we need to exchange > > registration tokens with the Apache INFRA team per agent instances. > > Further issues: > > - a registration token expires in an hour > > - troubleshooting the agent<->github communication is not possible > > without involving additional INFRA roundtrips. > > > > > > > > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > > > > > Hi Krisztian, > > > > > > > > Thanks for bringing this up. This is definitely becoming a > > > > high-priority topic for Arrow development. > > > > > > > > I don't believe there is much opportunity for reducing the number of > > > > builds or their runtime. We simply have a lot of development going > on, > > > > and the number of different CI jobs we have is simply because we > need to > > > > support many different configurations (and past experience has shown > > > > that they quickly stop working if we don't monitor them on a regular > basis). > > > > > > > > So I think the only path forward is to build up (== buy, probably) > our > > > > own execution resources for CI. Whether that entails using Github > > > > self-hosted runners, Buildkite, or yet another system, I have no > idea. > > > > > > > > I'll submit two requirements though: > > > > - the configuration for CI builds must be kept in the Arrow > repository > > > > (as they are currently in .github, etc.) > > > > - CI builds must be runnable from PRs > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit : > > > > > Hi, > > > > > > > > > > The Apache Github Actions agent pool seems to be oversubscribed as > > > > > more Apache projects migrate their CI setup to GHA. We experienced > > > > > pretty solid feedback times (~20-30m) when we originally moved to > GHA > > > > > but now we are roughly 5hrs behind [1]. > > > > > > > > > > Based on other projects' complaints and discussions [2][3] (doesn't > > > > > have all the links at hand) we can't expect a short term solution > from > > > > > infra. I think we *need* to figure out something on the project > level > > > > > instead to maintain the overall project health and to improve the > > > > > development velocity. > > > > > > > > > > I don't have a concrete proposal at the moment, but we should > start to > > > > > collect the available options. Ideas? > > > > > > > > > > Thanks, Krisztian > > > > > > > > > > [1]: > https://github.com/apache/arrow/actions?query=is%3Ain_progress > > > > > [2]: https://github.com/apache/pulsar/issues/9154 > > > > > [3]: https://issues.apache.org/jira/browse/SPARK-34053 > > > > > >