potiuk commented on a change in pull request #4938: [AIRFLOW-4117] 
Multi-staging Image - Travis CI tests [Step 3/3]
URL: https://github.com/apache/airflow/pull/4938#discussion_r299441892
 
 

 ##########
 File path: Dockerfile
 ##########
 @@ -278,42 +278,75 @@ RUN echo "Pip version: ${PIP_VERSION}"
 
 RUN pip install --upgrade pip==${PIP_VERSION}
 
-# We are copying everything with airflow:airflow user:group even if we use 
root to run the scripts
+ARG AIRFLOW_REPO=apache/airflow
+ENV AIRFLOW_REPO=${AIRFLOW_REPO}
+
+ARG AIRFLOW_BRANCH=master
+ENV AIRFLOW_BRANCH=${AIRFLOW_BRANCH}
+
+ENV 
AIRFLOW_GITHUB_DOWNLOAD=https://raw.githubusercontent.com/${AIRFLOW_REPO}/${AIRFLOW_BRANCH}
+
+# We perform fresh dependency install at the beginning of each month from the 
scratch
+# This way every month we re-test if fresh installation from the scratch 
actually works
+# As opposed to incremental installations which does not upgrade already 
installed packages unless it
+# is required by setup.py constraints.
+ARG BUILD_MONTH
+
+# We get Airflow dependencies (no Airflow sources) from the master version of 
Airflow in order to avoid full
+# pip install layer cache invalidation when setup.py changes. This can be 
reinstalled from the
+# latest master by increasing PIP_DEPENDENCIES_EPOCH_NUMBER.
+RUN mkdir -pv ${AIRFLOW_SOURCES}/airflow/bin \
+ && curl -L ${AIRFLOW_GITHUB_DOWNLOAD}/setup.py >${AIRFLOW_SOURCES}/setup.py \
+ && curl -L ${AIRFLOW_GITHUB_DOWNLOAD}/setup.cfg >${AIRFLOW_SOURCES}/setup.cfg 
\
 
 Review comment:
   It's not only for local development. It's also for CI builds. Just to 
explain what's going on here:
   
   The only way to move setup.py to docker context locally is to run COPY or 
ADD operation. Whenever setup.py changes and is part of the COPY/ADD then it 
means that Dockerfile gets invalidated from that point on.
   
   The case for CI this pattern solves is when your PR adds a new dependency.
   
   Without it - when you make A PR with dependency added in setup.py - the 
COPY/ADD causes that whole 'pip install' layer gets invalidated and 
re-installed from the scratch. This means that every job (pylint/mypy/tests 
etc.) will have to first perform the whole `pip install` from the scratch 
adding 4-5 minutes to the build time. 
   
   It is a bit strange pattern indeed. Simply because Docker build does not 
support this pattern yet :(. Howewer this change is not for ever and 
Docker/moby guys realise that. There are layer types (including cache 
specifically foreseen for that kind of behaviour) in the upcoming buildkit: 
https://github.com/moby/buildkit/blob/master/frontend/dockerfile/docs/experimental.md#run---mounttypecache
  - but it will likely take months to release stable version with buildkit that 
we can use (but when they will, I am super happy to update the docker image to 
follow it).
   
   Also we cannot do the single line you mentioned. In order to do that we 
would have to add `COPY .` (which we do eventually few lines below). Doing it 
will make it even worse - every single change to airflow sources will 
invalidate everything and will force 'pip install' to run from the scratch 
basically with every commit. That's why I separately COPY setup.py, setup.cfg 
bin/airflow first - because they do not change that often and they are needed 
to run the first pip install.
   
   I see your concern. While I understand it's not perfect solution, I think it 
is really good for CI optimisations until we have buildkit. What I can do 
instead - I can make those steps already controlled with 
"OPTIMISE_FOR_CI_BUILD" argument and skip those steps entirely when it is not 
set to "true" (which  I will do in the CI/breeze/pylint/mypy cases). Then the 
"non-CI-build-optimised" images will run just single "pip install" - the last 
one in the Dockerfile :). Also that will make it obvious that this is an 
optimisation rather than integral part of the "basic" Dockerfile.
   
   Would that be a good solution ? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to