Re: [PR] Configure Github Actions Runner Controller on EKS [airflow-ci-infra]

via GitHub Mon, 02 Sep 2024 23:25:34 -0700


potiuk commented on code in PR #59:
URL: https://github.com/apache/airflow-ci-infra/pull/59#discussion_r1741489405



##########
helm/values/gha-runner-scale-sets/runners.yaml:
##########
@@ -0,0 +1,49 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+---
+image: ghcr.io/apache/airflow-ci-infra/actions-runner:20240830-rc1
+runnerScaleSets:
+    arc-small-amd:
+        minRunners: 0
+        maxRunners: 30
+        size: small
+        arch: x64
+    arc-medium-amd:
+        minRunners: 0
+        maxRunners: 30
+        size: medium
+        arch: x64
+    arc-large-amd:
+        minRunners: 0
+        maxRunners: 30
+        size: large
+        arch: x64
+    arc-small-arm:
+        minRunners: 0
+        maxRunners: 30
+        size: small
+        arch: arm64
+    arc-medium-arm:
+        minRunners: 0
+        maxRunners: 30
+        size: medium
+        arch: arm64
+    arc-large-arm:
+        minRunners: 0
+        maxRunners: 30
+        size: large
+        arch: arm64

Review Comment:
   Interesting. I think eventually we should run tests on both ARM and AMD 
notes - this was actually a pre-requisite for me to make ARM part of the images 
"officially" supported. Currently we explain that ARM images shoudl only be 
used for development, but I always wanted to run complete set of tests on both 
AMD and ARM to be able to say "yeah - both platforms are equally well tested".
   
   > Running 30 nodes for 3h costs the same as running 90 nodes for 1h. I don't 
say that we have to scale to 1k, but we shouldn't worry about the max number of 
nodes/pods, we just need to correctly configure the scaling of ARC and nodes.
   
   If we consider cost alone - yes. But there are other factors that are in 
favour of using smaller number of bigger machines instead of bigger number of 
smaller ones. The number of workers - indeed, when we run our own nodes, it 
does not mattar much if we have 8x 1 CPU or 1x8CPU. Cost wise it does not 
matter for raw "test execution".
   
   And I agree it would be really nice if we could get rid of all the 
parallelism introduced by my scripts to run tests and other thing in parallel 
on a single instance. In this case for example logging output is much more 
readable (Breeze has this whole suite of uitls that handles multiplexing of the 
output from parallel processes running and "folding" the output - but due to 
size of the output sometimes GitHub interface is very sluggish and not 
convenient to look at the output.
   
   However there are certain benefits of running "big" instances in our case:
   
   1) for public runners - INFRA sets some 
[limits](https://lists.apache.org/thread/6qw21x44q88rc3mhkn42jgjjw94rsvb1) on 
how many "jobs" each workflow should run in parallel  (because it has limited 
number of public runners donated by GitHub that are shared betweeen all 
projects). It does not apply to our self-hosted runners of course, but in the 
case where we are using public runners with 2 CPUs running 2 tests in parallel 
will speed up the tests without increasing number of FT  (full time) runners we 
use. If you look at https://infra-reports.apache.org/#ghactions  - Airflow is 
at 10FT runners now (which is still way beyond the limit of 25FT) - mostly 
because we have parallelism implemented. 
   
   2) Having paralell run tests on big machines can decrease the overall cost 
of "cluster" a lot. When running tests and other docker-bound tasks (such as 
documentation building) - it takes some time to "warm-up" the machine - i.e.: 
Install Breeze, all necessary dependencies clean-up the machine, downloads all 
the images to be used (Breeze CI image and the DB image that needs to be used 
during tests - Postgres/MysSQL)  - this takes ~ 2/3 minutes in total but more 
importantly it takes a lot of space. 
   
   Some of that is parallelised (like downloading an image) so it will run 
faster on bigger instances which means that elapsed time of execution "per test 
type" will be smaller when we run more parallel tests. But more important is an 
overhead and space/memory used for the "common" image that is reused between 
all the parallel instances. Currently the tests are grouped on instance per 
"python version" and "database" - which means that the CI image and "database" 
image are downloaded once and used by all tests that are run in parallel for 
that big instance.
   
   So when each "test type" takes (say) 8 minutes - if we run it on a 16 CPU 
machine, we have to spend the 2-3 minutes of initializaiton only once - all the 
16 parallel tests will run on the pre-warmed machine using that single 
initialized "warmed" docker instance. This is about 4 GB of disk space and 
cache in memory per "running instance". if - instead - we run 16 x 1 CPU 
machines - then we will need 4GB x 16 disk space and memory to cache it. That 
means that effectively our cluster will be able to handle much more load 
because a lot of the small VMs running will have to use more disk pace/memory.  
Similarly they use the same kernel - which uses the same memory, so overall 
having a big machine running parallel tests on muliple docker instances using 
the same images should be more optimised.
   
   But we can test it once we have ARC in place as well, because there might be 
other factors (like imperfect parallisim - that has its limits - that might 
change the calculation - there is 
https://en.wikipedia.org/wiki/Gustafson%27s_law  - that explains that 
parallelism always has the limit because not everything can be parallelised and 
some things are always serialized - so there is a theorethical limit of 
speed-up (and cost optimisation) we can get by parallelising more.
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Configure Github Actions Runner Controller on EKS [airflow-ci-infra]

Reply via email to