potiuk commented on a change in pull request #15187:
URL: https://github.com/apache/airflow/pull/15187#discussion_r607024725
##########
File path: scripts/ci/docker-compose/integration-trino.yml
##########
@@ -17,10 +17,10 @@
---
version: "2.2"
services:
- presto:
- image: apache/airflow:presto-2020.10.08
- container_name: presto
- hostname: presto
+ trino:
+ image: apache/airflow:trino-2021.04.04
Review comment:
Thanks @dungdm93 for the comment. We had similar discussions in the
past, and I took the position that we should have copies of all the CI images
in airflow repo. And now it's not only my strong opinion but we have a proof
that it was a good decision.
First of all - Beware, this is a long answer :). This is only because I try
to explain and write it down also for myself - I am just about to start a
similar project in my daily work so I use that opportunity to make my reasoning
clear and have some good arguments written down and thoughts wrapped around it.
Also @kaxil @ashb @dimberman @turbaszek @mik-laj -> I think it might be good
you read that, that might be an interesting study for the future of either of
the projects/companies we work together on :).
Why do we want to have our ow images for 3rd-party non-official images ?
The Presto/Trino situation had shown exactly why my decision to copy the
images was good. One day out-of-the-sudden, the original `prestosql/prestodb`
image disappeared. This already happened - we witnessed it. We do not know when
it happened - it was some time between Jan and April 2021. But it did disappear
one day.
But for us ... it did not matter. We were completely unaffected by it. Our
CI continued to work. No disruption. We even added features to presto provider
without any need to change the CI process at all. This is the effect. And I
have foreseen it when I migrated all the CI images which were not "official"
images to use the ones that are in `apache/airflow` registry.
Usually the CI system is maintained by a handful of people and used by many
and if it breaks for whatever reason, everyone is affected but only a handful
of people can fix it. Everyone wants to use the CI process, but very, very,
very few has the time, skills and will to understand how it works. And when it
breaks - it's like literally everyone around expects those people who maintain
the CI to fix it. No matter what language it is, no matter how complex it is -
people expect it to work and when it does not, they will rather escalate the
problem rather than try to understand the root cause and fix it.
Note, that this is not a complaint - not at all. This is just a statement of
fact. I've seen it in every single place I've been - no matter if it was
Jenkins, Travis, CircleCI, GitHub Actions, Bamboo, custom CI, you name it. No
matter if it was a couple of scripts in bash, python-based build system, just a
bunch of custom "yaml" ci-specific tags put together. It does not matter. The
developers expect the CI system to work and very, very, very rarely they will
spend any time on fixing any problems there - because this is not their area of
expertise and they do not interact with it on a "daily" basis. No matter how
hard you try - for 20 years I've seen this happening (and I've seen it
happening as creator of such systems, user, manager of projects using those -
you name it). So it's not just `Jarek's complex CI systems`. It's `all CI
Systems out there`. I'ts just a fact of life.
In effect, this means that if there is a 3rd-party who can break the CI,
those handful of people can be put in 'emergency' mode by a 3rd-party decision
backed by all the users of CI who will 'complain' and expect the fix to happen.
Not good. So for me, the stability of the CI and it's independence from 3rd
parties trumps `easiness of changes` and often even it trumps `simplicity`.
Big time.
So after this long reasoning - in case of our own copy of images - imagine
what happens if we did not have our own image. Out of sudden, one day without
any warning, all the PRs of all the people to airflow would start to fail the
moment the image was removed. And we would have to scramble to find a
solution.
When this happened, the Trino image was likely not yet available as quick
fix or mature enough. And we would likely have to reverse engineer Dockerfile
commands from the original image, build our own, deal with potential
incompatibilities and missing configuration and entrypoint scripts that we
would have to extract from a copy of the image (if we were lucky enough to find
somewhere an existing image). And it took quite a few hours (and this PR) to
switch to Trino imge.
Or maybe we would have to dig through git history of the prestodb to find
out where are the scripts and how they built the image - hoping that it was not
a manual process, that it is well documented and that it is actually
reproducible. And all this in a big hurry, because of a number of people
complaining that our integration tests are broken. Because of those complaints,
likely that would mean some mitigation and updates to our CI process to exclude
presto from our CI build which will mean that even more time is needed in
"emergency" mode from those handful of people.. And often those things are
happening while those handful of people are in the middle of important work
project that you have to deliver, or they have difficult (or happy) family
situation to handle, and they have no extra time / cycles to suddenly fix
something that was caused by a 3rd-party's decision to pull the image.
It's the same reason why I am usually not picking up phone calls from
unknown numbers. I do not want others to impact how I spend my time, or how
other's problems are causing my emergencies. I saw a very nice quote recently
"Your lack of planning is not a reason for my emergency". This is the very
thing I hate a lot when someone's bad decisions or lack of planning and
communication and causing my emergencies. And I will always fiercely and
assertively oppose it, and will do a lot to prevent it.
I do not want those things to happen. If I can avoid it, and have the CI
under my project's full control, I'd do it. I do not want to depend on
3rd-party decisions to impact our CI stability. Too many people depend on it
and we have already too many reasons for instability - and continuously
fighting to get it under control.
Yes - we have to rely on something as a base and for example I can rely on
'debian:buster-slim', for example or `python` images. They are both "official"
images by DockerHub (https://docs.docker.com/docker-hub/official_images/), they
are maintained by DockerHub people (on top of the original maintainers) and you
are sure there will be there when needed (barring temporary problems like those
we experienced over the last 48 hours - but this was temporary incident). They
come with SLAs and nobody will just `pull the plug` as it happened with presto.
Last [point - when it comes to understanding/building the image - we are
using standard FROM: build and I think It's super easy to build the image -
simply run `./build_and_push.sh` (it fails on push if you have no rights to
push it). That's it. No more no less. I used it when I iterated with memory
parameters and while "slightly" annoying it is actually very simple and fast
workflow (building and even pushing the image incrementally, when you just
changed `entrypoint` takes seconds as only last layer is affected). And this
image and the entrypoint are not going to change more frequently than every few
months (last change to presto image building was done at 8th of Jan). There is
no reason other than regular refactor/maintenance and the need to upgrade the
version of base image to modify the entrypoint once we have the memory under
control. So no real need to support the 'mounted script' case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]