potiuk commented on code in PR #59: URL: https://github.com/apache/airflow-ci-infra/pull/59#discussion_r1741489405
########## helm/values/gha-runner-scale-sets/runners.yaml: ########## @@ -0,0 +1,49 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +--- +image: ghcr.io/apache/airflow-ci-infra/actions-runner:20240830-rc1 +runnerScaleSets: + arc-small-amd: + minRunners: 0 + maxRunners: 30 + size: small + arch: x64 + arc-medium-amd: + minRunners: 0 + maxRunners: 30 + size: medium + arch: x64 + arc-large-amd: + minRunners: 0 + maxRunners: 30 + size: large + arch: x64 + arc-small-arm: + minRunners: 0 + maxRunners: 30 + size: small + arch: arm64 + arc-medium-arm: + minRunners: 0 + maxRunners: 30 + size: medium + arch: arm64 + arc-large-arm: + minRunners: 0 + maxRunners: 30 + size: large + arch: arm64 Review Comment: Interesting. I think eventually we should run tests on both ARM and AMD notes - this was actually a pre-requisite for me to make ARM part of the images "officially" supported. Currently we explain that ARM images shoudl only be used for development, but I always wanted to run complete set of tests on both AMD and ARM to be able to say "yeah - both platforms are equally well tested". > Running 30 nodes for 3h costs the same as running 90 nodes for 1h. I don't say that we have to scale to 1k, but we shouldn't worry about the max number of nodes/pods, we just need to correctly configure the scaling of ARC and nodes. If we consider cost alone - yes. But there are other factors that are in favour of using smaller number of bigger machines instead of bigger number of smaller ones. The number of workers - indeed, when we run our own nodes, it does not mattar much if we have 8x 1 CPU or 1x8CPU. Cost wise it does not matter for raw "test execution". And I agree it would be really nice if we could get rid of all the parallelism introduced by my to run tests and other thing in parallel on a single instance. In this case for example logging output is much more readable (Breeze has this whole suite of uitls that handles multiplexing of the output from parallel processes running and "folding" the output - but due to size of the output sometimes GitHub interface is very sluggish and not convenient to look at the output. However there are certain benefits of running "big" instances in our case: 1) for public runners - INFRA sets some [limits](https://lists.apache.org/thread/6qw21x44q88rc3mhkn42jgjjw94rsvb1) on how many "jobs" each workflow should run in parallel (because it has limited number of public runners donated by GitHub that are shared betweeen all projects). It does not apply to our self-hosted runners of course, but in the case where we are using public runners with 2 CPUs running 2 tests in parallel will speed up the tests without increasing number of FT (full time) runners we use. If you look at https://infra-reports.apache.org/#ghactions - Airflow is at 10FT runners now (which is still way beyond the limit of 25FT) - mostly because we have parallelism implemented. 2) Having paralell run tests on big machines can decrease the overall cost of "cluster" a lot. When running tests and other docker-bound tasks (such as documentation building) - it takes some time to "warm-up" the machine - i.e.: Install Breeze, all necessary dependencies clean-up the machine, downloads all the images to be used (Breeze CI image and the DB image that needs to be used during tests - Postgres/MysSQL) - this takes ~ 2/3 minutes in total but more importantly it takes a lot of space. Some of that is parallelised (like downloading an image) so it will run faster on bigger instances which means that elapsed time of execution "per test type" will be smaller when we run more parallel tests. But more important is an overhead and space/memory used for the "common" image that is reused between all the parallel instances. Currently the tests are grouped on instance per "python version" and "database" - which means that the CI image and "database" image are downloaded once and used by all tests that are run in parallel for that big instance. So when each "test type" takes (say) 8 minutes - if we run it on a 16 CPU machine, we have to spend the 2-3 minutes of initializaiton only once - all the 16 parallel tests will run on the pre-warmed machine using that single initialized "warmed" docker instance. This is about 4 GB of disk space and cache in memory per "running instance". if - instead - we run 16 x 1 CPU machines - then we will need 4GB x 16 disk space and memory to cache it. That means that effectively our cluster will be able to handle much more load because a lot of the small VMs running will have to use more disk pace/memory. Similarly they use the same kernel - which uses the same memory, so overall having a big machine running parallel tests on muliple docker instances using the same images should be more optimised. But we can test it once we have ARC in place as well, because there might be other factors (like imperfect parallisim - that has its limits - that might change the calculation - there is https://en.wikipedia.org/wiki/Gustafson%27s_law - that explains that parallelism always has the limit because not everything can be parallelised and some things are always serialized - so there is a theorethical limit of speed-up (and cost) we can get by parallelising more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
