potiuk commented on issue #59625:
URL: https://github.com/apache/airflow/issues/59625#issuecomment-3703788875

   > Hi! I’m interested in working on this. For hardened images, should we aim 
for XYZ base image? Are there specific guidelines we must follow?
   
   Basically the same images we have now (debian:bookworm) - but ideally we 
should use "python images" and get rid of building python on our own (We 
recently switched from community "official" python images in 
https://github.com/apache/airflow/pull/53150 so we basically need to revert 
back - but using the "hardened" images. 
   
   The images itself (https://www.docker.com/products/hardened-images/) are 
already based on debian - and there seem to be all  python versions available 
in https://hub.docker.com/hardened-images/catalog/dhi/python/images. We want to 
use debian 12 (bookwrorm) based ones, and we do not want to use the FIPS 
version becuase FIPS variants are not available under free licence and require 
paid enterprise plan from Docker. However - it should be possible for those 
users who want to rebuild their images using fips-compliant images, so we 
should likely test it and describe as an option for users in "Build images" 
docs.
   
   @shahar1 mentioned that they require sign-in, and you will need to do 
`docker login ghi.io` with your account to work on those, but I think when we 
decide to merge them we will have to setup a separte process to mirror the 
images to "ASF" managed images where we will make them publicly available so 
that they can be used by anyildone during development without the need to login 
do ghi.io with docker credentials. When you got to the point when you want to 
test in our CI, let us know and we will make manual mirroring and once it works 
we will automatet the mirroring.
   
   Few caveats:
   
   * I think we should be able to simplify both Dockerfile.ci and Docker - 
because the images come in two variants: `dev` (with `-dev` suffix) and 
`runtime` (without suffix) and we can utilise it to simplify our image buildin. 
Currently what we are doing is:
   
   For CI image -> we use a single stage image (except scripts that are coming 
from it's own stage) -> and the image is bassed on "debian-slim" image and then 
we install build-essentials and other development tools needed. Likely the -dev 
image already has pretty much everything we neeed, so for CI image we should 
use the `-dev` image and strip out installing most development dependencies.
   
   For PROD image, this is a bit more complex (but also a nice optimisation for 
us) - because we use multi-stage images. Stage 1 (`build`) install (and 
potentially builds) all Python packages, and then we copy those packages from 
`build` stage to the `main` stage from .local folder - because we install those 
packages with `pip install --user`. In the `main` stage we use the `.local` 
folder as venv (works nicely) and we use "airflow" user - we also make the 
image openshift-compatible by using group `0` that all users (including 
`airflow`) share and automatically create a user if random uid is used by 
openshift. 
   
   So for PROD image we want to:
   
   * use -dev image for `build` stage  - and also remove unnecessary installing 
of build-essentials and similar dependencies.
   * use runtime image for `main` stage
   
   The little difficulty here is the question of users: the `-dev` images are 
root, but the runtime ones are `nonroot` which means that some user mapping 
will have to happen, and (because Python installation is not relocatable) - 
some things with user management might need fixing.
   
   Of course entrypoints might need some adjustments (we are using dumb-init) - 
 also we need to come back to passing "base image" through `build args` as it 
was before #59517  - and this shoudl be a way for our users to pass `fips` 
image as build arg if they want to make the image fips-compliant. 
   
   The images as well have something calls `sfw` - secure firewall - which 
apparently is providing some (proxy based?) security when installing packages, 
I do not know details of it - but it also might have some impact. 
   
   There is also an issue we discovered recently - the "official" images had 
`.pyc` CPython files removed from installation of Python - which had very 
little savings for size of the image, but caused an unexpected memory leak -> 
so we need to check if the hardened images do not have the same problem (and we 
might even report that issue to Docker if they do) - fix to it was 
https://github.com/apache/airflow/pull/58944 - for that we might need to 
compile all CPython files - both in `build` and `main` segment - but this might 
be somewhat more complicated for nonroot main image (we woudl need to switch to 
root temporarily).
   
   Ideally - we shoudl follow exactly what Hardened images do when it comes to 
security - signing, sboms etc. but this could ba a follow up - we've never done 
that for our images, but it's great opportunity to just follow what docker does 
with their images.
   
   We should also compare sizes of images before and after the change to avoid 
balooning the size of the image.
   
   That's about all guidelines I can come up with from the top of my head.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to