V0lantis commented on PR #35026:
URL: https://github.com/apache/airflow/pull/35026#issuecomment-1770403457
> I would love to learn more about your use case and see how and whether it
can help at all.
The goal here is to try every possible way to improve the build time of
Airflow. We are using custom building and in our CI, the build doesnt use the
cached layers 😠So we have an average build time of 8minutes. This is not much
but a lot of people are working on our repository and we would like to improve
their developer experience as a whole.
> I am not sure if it changes anything and it adds some assumptions that the
image will be built several times on the same machine while also for some
reason it will not use the already cached layers.
Yes, in our CI, the layers are not cached even we are using the buildkit
experimental feature :
```
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
with:
version: latest
driver-opts: image=moby/buildkit:latest
buildkitd-flags: --debug
```
and in our `build` command:
```sh
docker buildx build .... \
--cache-to=type=gha,scope=platform-$ARCH,mode=max \
--cache-from=type=gha,scope=platform-$ARCH"
```
> Could you please show some usage where it actually helps and improve
things ? Some benchmarks showing what you've done and before/after would be
good.
Ok, so it's not a proper benchmark, I just used the `time` bash method, but
it can give you an idea
### **Without** mounting cache:
```bash
$ time docker build \
--build-arg PYTHON_BASE_IMAGE=python:3.11-slim-bullseye \
--build-arg AIRFLOW_VERSION=2.6.3 \
--build-arg AIRFLOW_HOME=/usr/local/airflow \
--build-arg AIRFLOW_EXTRAS=async,celery,redis,google_auth,hashicorp,statsd
\
--build-arg 'ADDITIONAL_PYTHON_DEPS=apache-airflow-providers-amazon==8.3.1
apache-airflow-providers-celery==3.2.1
apache-airflow-providers-common-sql==1.6.0
apache-airflow-providers-cncf-kubernetes==7.4.2
apache-airflow-providers-datadog==3.3.1 apache-airflow-providers-docker==3.7.1
apache-airflow-providers-ftp==3.4.2 apache-airflow-providers-github==2.3.1
apache-airflow-providers-google==10.3.0 apache-airflow-providers-http==4.4.2
apache-airflow-providers-imap==3.2.2 apache-airflow-providers-mysql==5.1.1
apache-airflow-providers-postgres==5.5.2 apache-airflow-providers-sftp==4.3.1
apache-airflow-providers-slack==7.3.1 apache-airflow-providers-ssh==3.7.1
apache-airflow-providers-tableau==4.2.1 apache-airflow-providers-zendesk==4.3.1
apache-airflow-providers-salesforce==5.4.1' \
--build-arg 'ADDITIONAL_RUNTIME_APT_DEPS=groff gettext git' \
--build-arg
AIRFLOW_CONSTRAINTS_LOCATION=/docker-context-files/constraints-airflow.txt \
--build-arg DOCKER_CONTEXT_FILES=docker-context-files \
--build-arg INSTALL_MYSQL_CLIENT=true \
--build-arg INSTALL_MSSQL_CLIENT=false \
--build-arg ADDITIONAL_DEV_APT_DEPS=pkgconf \
--no-cache
33.74s user 15.55s system 11% cpu 7:01.16 total
```
In the above, all pip requirements have been downloaded before being
installed. In my case, I don't have at the moment a good internet connection,
so it is even more slower than someone with a good internet connection
### **WITH** mounting cache:
```bash
$ time docker build
...
--no-cache
17.52s user 7.62s system 7% cpu 5:16.20 total
```
The `--no-cache` method argument above doesn't influence the mounted pip
cache.
> Quite for sure caching when installing `pip` (first part of your change)
has no effect.
Sure, mistake from my side, I removed the line in 77c6813
> This is something that is only done once to upgrade to latest version and
then (regardless of caching) `pip` does not get re-installed again, because it
is already in the target version. And (unless you have some other scenario in
mind) re-using cache will only actually work when you use the same docker
engine. And in this case - unless you change your docker file parameters, it is
already handled by docker layer caching. I.e. if you are building your docker
image twice in the same docker engine with the same parameters, then caching is
done at the "container layer" level. So when you run your build again, on the
same docker engine you should not see it trying to install Python again. You
should instead see that CACHED layer is used. And this is all without having to
mount cache volume.
>
> So I wonder why you would like to cache pip upgrade with cache.
>
> The second case might be a bit more interesting - but there also you will
see reinstallation only happening if you change requirements.txt. And yes - in
this case it will be a bit faster when you just add one requirement, but this
will also work for the same user when rebuilding the image over and over again
on the same machine, which is generally not something that you should use
airflow's dockerfile for (usually) - usually this can be done by extending the
image (`FROM apache/airflow`) and adding lines in your own Dockerfile (where
you could use indeed --mount-type cache etc.).
>
> I wonder if you could describe your case in more detail and show some
benchmark example of how much time is saved for particular scenarios.
>
> I am not totally against it, but I would love to understand your case,
because a) maybe you do not gain as much as you think or b) maybe you are doing
somethign that prevents you from using docker layer caching or c) maybe you are
using the Dockerfile of Airlfow in unintended way.
>
> Looking forward for more detailed explanation of your case.
>
> So caching `pip` install makes no sense at all.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]