potiuk commented on a change in pull request #15187:
URL: https://github.com/apache/airflow/pull/15187#discussion_r607024725



##########
File path: scripts/ci/docker-compose/integration-trino.yml
##########
@@ -17,10 +17,10 @@
 ---
 version: "2.2"
 services:
-  presto:
-    image: apache/airflow:presto-2020.10.08
-    container_name: presto
-    hostname: presto
+  trino:
+    image: apache/airflow:trino-2021.04.04

Review comment:
       Thanks @dungdm93 for the comment. We had similar discussions in the 
past, and I took the position that we should have copies of all the CI images 
in airflow repo. And now it's not only my strong opinion but we have a proof 
that it was a good decision.
   
   First of all - Beware, this is a long answer :). This is only because I try 
to explain and write it down  also for myself - I am just about to start a 
similar project in my daily work so I use that opportunity to make my reasoning 
clear and have some good arguments written down and thoughts wrapped around it. 
   
   Also @kaxil @ashb @dimberman @turbaszek @mik-laj -> I think it might be good 
you read that, that might be an interesting study for the future of either of 
the companies we work together on :).
   
   Why do we want to have our ow images for 3rd-party non-official images ? 
   
   The presto/trino situation had shown exactly why my decision to copy the 
images was good. One day out-of-the-sudden, the original `prestosql/prestodb` 
image disappeared. This already happened - we witnessed it.
   
   But for us ... it did not matter. We were completely unaffected by it. Our 
CI continued to work. No disruption. We even added features to presto provider 
without any need to change the CI process at all. This is the effect. And I 
have foreseen it when I migrated all the CI images which were not "official" 
images to use the ones that are in apache/airflow registry.
   
   Usually the CI system is maintained by a handful of people and used by many 
and if it breaks for whatever reason, everyone is affected but only a handful 
of people can fix it. Everyone wants to use the CI process, but very, very, 
very few has the time, skills and will to understand how it works. And when it 
breaks - it's like literally everyone around expects those people who maintain 
the CI to fix it. No matter what language it is, no matter how complex it is - 
people expect it to work and when it does not, they will rather escalate the 
problem rather than try to understand the root cause and fix it. 
   
   Note, that this is not a complaint - not at all. This is just a statement of 
fact. I've seen it in every single place I've been - no matter if it was 
Jenkins, Travis, CircleCI, GitHub Actions, Bamboo,  custom CI, you name it. No 
matter if it was a couple of scripts in bash, python-based build system, just a 
bunch of custom "yaml" ci-specific tags put together. It does not matter. The 
developers expect the CI system to work and very, very, very rarely they will 
spend any time on fixing any problems there - because this is not their area of 
expertise and they do not interact with it on a "daily" basis. No matter how 
hard you try - for 20 years I've seen this happening (and I've seen it 
happening as creator of such systems, user, manager of projects using those - 
you name it). So it's not just `Jarek's complex CI systems`. It's `all CI 
Systems out there`. I'ts just a fact of life.
   
   In effect, this means that if there is a 3rd-party who can break the CI, 
those handful of people can be put in 'emergency' mode by a 3rd-party decision 
backed by all the users of CI who will 'compllain' and expect the fix to 
happen. Not good. So for me, the stability of the CI  and it's independence 
from 3rd parties trumps 'easiness of changes` and sometimes even `simplicity`.  
Big time.
   
   So after this long reasoning - in case of  our own copy of images - imagine 
what happens if we did not have our own image. Out of  sudden, one day without 
any warning, all the PRs of all the people to airflow would start to fail the 
moment the image was removed.  And we would have to scramble to find a 
solution. When this happened, the trino image was likely not yet available as 
quick fix or mature enough. And we would likely have to reverse engineer 
dockerfile commands from the original image, build our own, deal with potential 
incompatibilities and missing configuration and entrypoint scripts that we 
would have to extract from a copy of the image (if we were lucky enough to find 
somewhere an existing image). Or  maybe we would have to dig through git 
history of the prestodbsql to find out where are the scripts and how they built 
the image - hoping that it was not a manual process, that it is well documented 
and that it is actually reproducible. And all this in a big hurry, becau
 se of a number of people complaining that our integration tests are broken, so 
likely that would mean some mitigations and updates to our CI process to 
exclude presto from our CI build. And often those things are happening while 
you are in the middle of important work project that you have to deliver and 
you have no extra time to get to suddenly fix something that was caused by a 
3rd-party's decision to pull the image.
   
   I do not want those things to happen. If I can avoid it, and have the CI 
under my project's full control, I'd do it. I do not want to depend on 
3rd-party decisions to impact our CI stability. Too many people depend on it 
and we have already too many reasons for instability - and continuously 
fighting to get it under control. 
   
   Yes - we have to rely on something as a base and for example I can rely on 
'debian:buster-slim', for example or `python` images. They are both "official" 
images by DockerHub (https://docs.docker.com/docker-hub/official_images/), they 
are maintained by DockerHub people (on top of the original maintainers) and you 
are sure there will be there when needed (barring temporary problems like those 
we experienced over the last 48 hours - but this was temporary incident). They 
come with SLAs and nobody will just 'pull them' as it happened with presto.
   
   Last [point - when it comes to understanding/building the image - we are 
using standard FROM: build and I think It's super easy to build the image - 
simply run `./build_and_push.sh`  (it fails on push if you have no rights to 
push it). That's it. No more no less. I used it when I iterated with memory 
parameters and while "slightly" annoying it is actually very simple and fast 
workflow (building and even pushing the image incrementally, when you just 
changed `entrypoint` takes seconds as only last layer is affected). And this 
image and the entrypoint are not going to change more frequently than every few 
months (last change to presto image building was done at 8th of Jab). There is 
no reason other than regular refactor/maintenance and the need to upgrade the 
version of base image to modify the entrypoint once we have the memory under 
control. So no real need to support the 'mounted script' case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to