Re: [PR] Configure Github Actions Runner Controller on EKS [airflow-ci-infra]

via GitHub Sun, 08 Sep 2024 04:40:01 -0700


potiuk commented on code in PR #59:
URL: https://github.com/apache/airflow-ci-infra/pull/59#discussion_r1749199527



##########
runner/Dockerfile:
##########
@@ -0,0 +1,35 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+FROM ghcr.io/actions/actions-runner:latest
+
+USER root
+
+RUN apt-get update \
+    && apt-get install -y --no-install-recommends \
+    ca-certificates curl nodejs npm wget unzip vim git jq build-essential 
netcat \

Review Comment:
   Short answr: I think emulation as an option should remain.
   
   Long answer:
   
   Here (and this is already part of the "CI/CD" knowledge transfer :) ) - let 
me explain why I think it should.
   
    I prefer - if possible - to have manual backup for all the "automated" 
processes we have in CI.  Generallt my "guiding principle" for any kind of CI 
work like that is to "NEVER 100% rely on particular CI doing the job 
exclusively". I simply hate to be put in a situation where "something else" 
fails and the only answer is "we have to wait for them to fix it". This is ok 
for temporary disruptions, but when it comes to processes like releasing that 
we do often under time pressure and when there are expectations of a user for 
the new release to come up quickly - I prefer to not have to rely on 
3rd-parties as much as possible.
   
   We actualy saw that last few weeks where the CI workflow to release RC 
packages was broken - we could only release Airlfow without having to find a 
"proper" solution immediately - because we had manual process, where you could 
either use hardware if you happen to have two machines or emulation (which 
@kaxil used) - even if it means it will take hours instead of minutes. We had 
plan B, and plan C which involved not only somone (like me) who had the right 
hardware setup, but also someone who has just a local machine, good networking 
and can "fire-and-forget" a process that runs for an hour rather than 10 
minutes, without any special environment configuration.
   
   BTW. In this case - even if I could help with having the hardware setup, my 
AMD linux worstation at home ACTUALLY BROKE last week and I got it back only on 
Friday :D....
   
   So ALWAYS HAVE PLAN B (and C and D sometimes). .... That allows the CI team 
sleep better :)
   
   In our case we have manual processes for all  things that CI jobs currently 
do automatically, and we do not have a single part of the process that 
exclusively relies on GitHub Actons CI doing the job:
   
   * first of all - all commands in CI we run are not "GitHub Actions" 
exclusively (except some environment setup) - most of the important actions are 
`breeze` commands that can be run manually, locally. This is the main reason 
why we have `breeze` actually - and it's nicely captured in this "Architecture 
Decision Record" 
https://github.com/apache/airflow/blob/main/dev/breeze/doc/adr/0002-implement-standalone-python-command.md
   
   This basically means that if you look at each step of every CI job and 
replicate them locally - you should be able to get the same result locally. So 
if - for whatever reason - our CI will stop working (say ASF will limit our 
procesing time and we have no money for "self-hosted" runner in AWS - we will 
be able to replicate (slower and more painfully) what is now happening in CI - 
manually. 
   
   * secondly - for processes that are likely to fail for whatever reason, we 
describe manual processes in "step-by-step" guides, explaining a) why we might 
need to do it b) how to perform setup of the environment c) how to run it by 
"human"
   
   Those are the current processes described this way:
   
   * https://github.com/apache/airflow/blob/main/dev/MANUALLY_BUILDING_IMAGES.md
   * 
https://github.com/apache/airflow/blob/main/dev/MANUALLY_GENERATING_IMAGE_CACHE_AND_CONSTRAINTS.md
   
   So - whatever we do, we have to keep the "manual path" working as a backup 
plan.
   
   Using hardware is a bit problematic because you have to have two machines 
(ARM and AMD) handy and connected - yes it can be done using cloud (of course) 
but - ideally, the fallback option we have is to use a local machine of one of 
the  PMC member to do all the stuff above - so that we do not rely on GH 
actions or even AWS account to be available. That's why emulation is going to 
stay - I think - as a backup plan.
   
   
   
   
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Configure Github Actions Runner Controller on EKS [airflow-ci-infra]

Reply via email to