* What kind of devops tooling would be appropriate to provision and manage the instances, scaling up and down based on need? * What CI/CD platform would be appropriate to dispatch work to the cloud nodes (taking into consideration the high costs of sysadmin, and seeking to minimize nodes sitting unused)?
I looked into solutions for running CI/CD workers on GCP a (very) little bit and just wanted to shared some findings. Appveyor claims it can auto-scale GCE instances [1] but I don't think it would go beyond 5 concurrent "self-hosted" jobs [2]. Would that be a problem? BuildKite has documentation about running agents on a scalable GKE cluster [3], but unfortunately no way to auto-scale based on the backlog. We could maybe roll our own/contribute something based on their AWS scaler [4]. [1] https://www.appveyor.com/docs/byoc/gce/ [2] https://www.appveyor.com/pricing/ [3] https://buildkite.com/docs/agent/v3/gcloud#running-the-agent-on-google-kubernetes-engine [4] https://github.com/buildkite/buildkite-agent-scaler On Wed, Mar 11, 2020 at 7:49 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can > > donate cloud compute credits to the project > > Google has offered a donation of GCP credits based on some estimates I made > last year when we were facing Travis CI issues. I'm happy to try to do some > integration work to help make this happen. > > For the other questions, I'm happy to do some research, but also happy if > someone else would like to take up the work here. I think one blocker in > the past has been restrictions from Apache Infra, is there any > documentation on what is and is not supported on that front? > > Thanks, > Micah > On Wed, Mar 11, 2020 at 3:17 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi folks, > > > > There has periodically been a discussion about employing dedicated > > compute resources to serve our testing needs beyond what can be > > accomplished in free / public CI services like GitHub Actions, > > Appveyor, etc. For example: > > > > * Workloads requiring a CUDA-capable GPU > > * Tests requiring a lot of memory > > * ARM architecture > > > > While physical machines can be hooked up to some CI/CD services like > > Github Actions and Buildkite, I believe we should not be 100% > > dependent on the availability of such hardware (the recent tornado in > > Nashville is a good example of what can go wrong). > > > > At some point it will make sense to be able to provision cloud hosts > > (either temporary spot instances or persistent nodes) to meet these > > needs. This brings up several questions: > > > > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can > > donate cloud compute credits to the project > > * What kind of devops tooling would be appropriate to provision and > > manage the instances, scaling up and down based on need? > > * What CI/CD platform would be appropriate to dispatch work to the > > cloud nodes (taking into consideration the high costs of sysadmin, and > > seeking to minimize nodes sitting unused)? > > > > This will probably take time to work out and there is significant > > engineering involved in achieving any solution, but it would be good > > to have all the options on the table with a frank analysis of the > > pros/cons and costs (both in money and volunteer time) involved. > > > > Thanks, > > Wes > > >