[GitHub] [airflow] potiuk commented on a change in pull request #15187: Adds 'Trino' provider (with lower memory footprint for tests)

GitBox Mon, 05 Apr 2021 03:59:00 -0700


potiuk commented on a change in pull request #15187:
URL: https://github.com/apache/airflow/pull/15187#discussion_r607024725




##########
File path: scripts/ci/docker-compose/integration-trino.yml
##########
@@ -17,10 +17,10 @@
 ---
 version: "2.2"
 services:
-  presto:
-    image: apache/airflow:presto-2020.10.08
-    container_name: presto
-    hostname: presto
+  trino:
+    image: apache/airflow:trino-2021.04.04

Review comment:
       Thanks @dungdm93 for the comment. We had similar discussions in the 
past, and I took the position that we should have copies of all the CI images 
in airflow repo. And now it's not only my strong opinion but we have a proof 
that it was a good decision.
   
   First of all - Beware, this is a long answer :). This is only because I try 
to explain and write it down  also for myself - I am just about to start a 
similar project in my daily work so I use that opportunity to make my reasoning 
clear and have some good arguments written down and thoughts wrapped around it. 
   
   Also @kaxil @ashb @dimberman @turbaszek @mik-laj -> I think it might be good 
you read that, that might be an interesting study for the future of either of 
the projects/companies we work together on :).
   
   Why do we want to have our ow images for 3rd-party non-official images ? 
   
   The Presto/Trino situation had shown exactly why my decision to copy the 
images was good. One day out-of-the-sudden, the original `prestosql/prestodb` 
image disappeared. This already happened - we witnessed it. We do not know when 
it happened - it was some time between Jan and April 2021. But it did disappear 
one day.
   
   But for us ... it did not matter. We were completely unaffected by it. Our 
CI continued to work. No disruption. We even added features to presto provider 
without any need to change the CI process at all. This is the effect. And I 
have foreseen it when I migrated all the CI images which were not "official" 
images to use the ones that are in `apache/airflow` registry.
   
   Usually the CI system is maintained by a handful of people and used by many 
and if it breaks for whatever reason, everyone is affected but only a handful 
of people can fix it. Everyone wants to use the CI process, but very, very, 
very few has the time, skills and will to understand how it works. And when it 
breaks - it's like literally everyone around expects those people who maintain 
the CI to fix it. No matter what language it is, no matter how complex it is - 
people expect it to work and when it does not, they will rather escalate the 
problem rather than try to understand the root cause and fix it. 
   
   Note, that this is not a complaint - not at all. This is just a statement of 
fact. I've seen it in every single place I've been - no matter if it was 
Jenkins, Travis, CircleCI, GitHub Actions, Bamboo,  custom CI, you name it. No 
matter if it was a couple of scripts in bash, python-based build system, just a 
bunch of custom "yaml" ci-specific tags put together. It does not matter. The 
developers expect the CI system to work and very, very, very rarely they will 
spend any time on fixing any problems there - because this is not their area of 
expertise and they do not interact with it on a "daily" basis. No matter how 
hard you try - for 20 years I've seen this happening (and I've seen it 
happening as creator of such systems, user, manager of projects using those - 
you name it). So it's not just `Jarek's complex CI systems`. It's `all CI 
Systems out there`. I'ts just a fact of life.
   
   In effect, this means that if there is a 3rd-party who can break the CI, 
those handful of people can be put in 'emergency' mode by a 3rd-party decision 
backed by all the users of CI who will 'complain' and expect the fix to happen. 
Not good. So for me, the stability of the CI  and it's independence from 3rd 
parties trumps `easiness of changes` and often even it trumps `simplicity`.  
Big time.
   
   So after this long reasoning - in case of  our own copy of images - imagine 
what happens if we did not have our own image. Out of  sudden, one day without 
any warning, all the PRs of all the people to airflow would start to fail the 
moment the image was removed.  And we would have to scramble to find a 
solution. 
   
   When this happened, the Trino image was likely not yet available as quick 
fix or mature enough. And we would likely have to reverse engineer Dockerfile 
commands from the original image, build our own, deal with potential 
incompatibilities and missing configuration and entrypoint scripts that we 
would have to extract from a copy of the image (if we were lucky enough to find 
somewhere an existing image). And it took quite a few hours (and this PR) to 
switch to Trino imge. 
   
   Or maybe we would have to dig through git history of the prestodb to find 
out where are the scripts and how they built the image - hoping that it was not 
a manual process, that it is well documented and that it is actually 
reproducible. And all this in a big hurry, because of a number of people 
complaining that our integration tests are broken. Because of those complaints, 
likely that would mean some mitigation and updates to our CI process to exclude 
presto from our CI build which will mean that even more time is needed in 
"emergency" mode from those handful of people.. And often those things are 
happening while those handful of people are in the middle of important work 
project that you have to deliver, or they have difficult (or happy) family 
situation to handle, and they have no extra time / cycles to suddenly fix 
something that was caused by a 3rd-party's decision to pull the image. 
   
   It's the same reason why I am usually not picking up phone calls from 
unknown numbers. I do not want others to impact how I spend my time, or how 
other's problems are causing my emergencies. I saw a very nice quote recently 
"Your lack of planning is not a reason for my emergency". This is the very 
thing I hate a lot when someone's bad decisions or lack of planning and 
communication and causing my emergencies. And I will always fiercely and 
assertively oppose it, and will do a lot to prevent it.
   
   I do not want those things to happen. If I can avoid it, and have the CI 
under my project's full control, I'd do it. I do not want to depend on 
3rd-party decisions to impact our CI stability. Too many people depend on it 
and we have already too many reasons for instability - and continuously 
fighting to get it under control. 
   
   Yes - we have to rely on something as a base and for example I can rely on 
'debian:buster-slim', for example or `python` images. They are both "official" 
images by DockerHub (https://docs.docker.com/docker-hub/official_images/), they 
are maintained by DockerHub people (on top of the original maintainers) and you 
are sure there will be there when needed (barring temporary problems like those 
we experienced over the last 48 hours - but this was temporary incident). They 
come with SLAs and nobody will just `pull the plug` as it happened with presto.
   
   Last point - when it comes to understanding/building the image - we are 
using standard FROM: build and I think It's super easy to build the image - 
simply run `./build_and_push.sh`  (it fails on push if you have no rights to 
push it). That's it. No more no less. I used it when I iterated with memory 
parameters and while "slightly" annoying it is actually very simple and fast 
workflow (building and even pushing the image incrementally, when you just 
changed `entrypoint` takes seconds as only last layer is affected). And this 
image and the entrypoint are not going to change more frequently than every few 
months (last change to presto image building was done at 8th of Jan). There is 
no reason other than regular refactor/maintenance and the need to upgrade the 
version of base image to modify the entrypoint once we have the memory under 
control. So no real need to support the 'mounted script' case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a change in pull request #15187: Adds 'Trino' provider (with lower memory footprint for tests)

Reply via email to