V0lantis commented on PR #35026:
URL: https://github.com/apache/airflow/pull/35026#issuecomment-1770403457

   > I would love to learn more about your use case and see how and whether it 
can help at all.
   
   The goal here is to try every possible way to improve the build time of 
Airflow. We are using custom building and in our CI, the build doesnt use the 
cached layers 😭 So we have an average build time of 8minutes. This is not much 
but a lot of people are working on our repository and we would like to improve 
their developer experience as a whole.
   
   > I am not sure if it changes anything and it adds some assumptions that the 
image will be built several times on the same machine while also for some 
reason it will not use the already cached layers.
   
   Yes, in our CI, the layers are not cached even we are using the buildkit 
experimental feature :
   
   ```
         - name: Set up Docker Buildx
           uses: docker/setup-buildx-action@v3
           with:
             version: latest
             driver-opts: image=moby/buildkit:latest
             buildkitd-flags: --debug
   ```
   
   and in our `build` command:
   
   ```sh
   docker buildx build .... \ 
   --cache-to=type=gha,scope=platform-$ARCH,mode=max \
   --cache-from=type=gha,scope=platform-$ARCH"
   ```
   
   > Could you please show some usage where it actually helps and improve 
things ? Some benchmarks showing what you've done and before/after would be 
good.
   
   Ok, so it's not a proper benchmark, I just used the `time` bash method, but 
it can give you an idea
   
   ### **Without** mounting cache:
   
   ```bash
   $ time  docker build \
     --build-arg PYTHON_BASE_IMAGE=python:3.11-slim-bullseye \
     --build-arg AIRFLOW_VERSION=2.6.3 \
     --build-arg AIRFLOW_HOME=/usr/local/airflow \
     --build-arg AIRFLOW_EXTRAS=async,celery,redis,google_auth,hashicorp,statsd 
\
     --build-arg 'ADDITIONAL_PYTHON_DEPS=apache-airflow-providers-amazon==8.3.1 
apache-airflow-providers-celery==3.2.1 
apache-airflow-providers-common-sql==1.6.0 
apache-airflow-providers-cncf-kubernetes==7.4.2 
apache-airflow-providers-datadog==3.3.1 apache-airflow-providers-docker==3.7.1 
apache-airflow-providers-ftp==3.4.2 apache-airflow-providers-github==2.3.1 
apache-airflow-providers-google==10.3.0 apache-airflow-providers-http==4.4.2 
apache-airflow-providers-imap==3.2.2 apache-airflow-providers-mysql==5.1.1 
apache-airflow-providers-postgres==5.5.2 apache-airflow-providers-sftp==4.3.1 
apache-airflow-providers-slack==7.3.1 apache-airflow-providers-ssh==3.7.1 
apache-airflow-providers-tableau==4.2.1 apache-airflow-providers-zendesk==4.3.1 
apache-airflow-providers-salesforce==5.4.1' \
     --build-arg 'ADDITIONAL_RUNTIME_APT_DEPS=groff gettext git' \
     --build-arg 
AIRFLOW_CONSTRAINTS_LOCATION=/docker-context-files/constraints-airflow.txt \
     --build-arg DOCKER_CONTEXT_FILES=docker-context-files \
     --build-arg INSTALL_MYSQL_CLIENT=true \
     --build-arg INSTALL_MSSQL_CLIENT=false \
     --build-arg ADDITIONAL_DEV_APT_DEPS=pkgconf \
     --no-cache
   
   33.74s user 15.55s system 11% cpu 7:01.16 total
   ```
   
   In the above, all pip requirements have been downloaded before being 
installed. In my case, I don't have at the moment a good internet connection, 
so it is even more slower than someone with a good internet connection
   
   ### **WITH** mounting cache:
   
   ```bash
   $ time  docker build
     ...
     --no-cache
   
   17.52s user 7.62s system 7% cpu 5:16.20 total
   ```
   
   The `--no-cache` method argument above doesn't influence the mounted pip 
cache.
   
   > Quite for sure caching when installing `pip` (first part of your change) 
has no effect.
   
   Sure, mistake from my side, I removed the line in 77c6813
   > This is something that is only done once to upgrade to latest version and 
then (regardless of caching) `pip` does not get re-installed again, because it 
is already in the target version. And (unless you have some other scenario in 
mind) re-using cache will only actually work when you use the same docker 
engine. And in this case - unless you change your docker file parameters, it is 
already handled by docker layer caching. I.e. if you are building your docker 
image twice in the same docker engine with the same parameters, then caching is 
done at the "container layer" level. So when you run your build again, on the 
same docker engine you should not see it trying to install Python again. You 
should instead see that CACHED layer is used. And this is all without having to 
mount cache volume.
   > 
   > So I wonder why you would like to cache pip upgrade with cache.
   > 
   > The second case might be a bit more interesting - but there also you will 
see reinstallation only happening if you change requirements.txt. And yes - in 
this case it will be a bit faster when you just add one requirement, but this 
will also work for the same user when rebuilding the image over and over again 
on the same machine, which is generally not something that you should use 
airflow's dockerfile for (usually) - usually this can be done by extending the 
image (`FROM apache/airflow`) and adding lines in your own Dockerfile (where 
you could use indeed --mount-type cache etc.).
   > 
   > I wonder if you could describe your case in more detail and show some 
benchmark example of how much time is saved for particular scenarios.
   > 
   > I am not totally against it, but I would love to understand your case, 
because a) maybe you do not gain as much as you think or b) maybe you are doing 
somethign that prevents you from using docker layer caching or c) maybe you are 
using the Dockerfile of Airlfow in unintended way.
   > 
   > Looking forward for more detailed explanation of your case.
   > 
   > So caching `pip` install makes no sense at all.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to