potiuk commented on a change in pull request #4938: [AIRFLOW-4117]
Multi-staging Image - Travis CI tests [Step 3/3]
URL: https://github.com/apache/airflow/pull/4938#discussion_r299357268
##########
File path: Dockerfile
##########
@@ -278,42 +278,75 @@ RUN echo "Pip version: ${PIP_VERSION}"
RUN pip install --upgrade pip==${PIP_VERSION}
-# We are copying everything with airflow:airflow user:group even if we use
root to run the scripts
+ARG AIRFLOW_REPO=apache/airflow
+ENV AIRFLOW_REPO=${AIRFLOW_REPO}
+
+ARG AIRFLOW_BRANCH=master
+ENV AIRFLOW_BRANCH=${AIRFLOW_BRANCH}
+
+ENV
AIRFLOW_GITHUB_DOWNLOAD=https://raw.githubusercontent.com/${AIRFLOW_REPO}/${AIRFLOW_BRANCH}
+
+# We perform fresh dependency install at the beginning of each month from the
scratch
+# This way every month we re-test if fresh installation from the scratch
actually works
+# As opposed to incremental installations which does not upgrade already
installed packages unless it
+# is required by setup.py constraints.
+ARG BUILD_MONTH
+
+# We get Airflow dependencies (no Airflow sources) from the master version of
Airflow in order to avoid full
+# pip install layer cache invalidation when setup.py changes. This can be
reinstalled from the
+# latest master by increasing PIP_DEPENDENCIES_EPOCH_NUMBER.
+RUN mkdir -pv ${AIRFLOW_SOURCES}/airflow/bin \
+ && curl -L ${AIRFLOW_GITHUB_DOWNLOAD}/setup.py >${AIRFLOW_SOURCES}/setup.py \
+ && curl -L ${AIRFLOW_GITHUB_DOWNLOAD}/setup.cfg >${AIRFLOW_SOURCES}/setup.cfg
\
Review comment:
It's really an optimisation only. If I copy the setup.py from local context,
it will invalidate all the subsequent layers when the setup.py changes - which
means that on setup.py change we always reinstall everything from scratch.
This is both good and bad. Good that pip install is from the scratch. Bad -
that it takes a lot of time every time you add/modify even a single dependency.
For sure when we make a production image we should skip that step and always
reinstall from the scratch. But for CI images and also following steps like
breeze or pre-commit hooks, I think it makes perfect sense to optimise for
speed of building.
In the pattern I propose - we already have all "fresh" dependencies
installed from master - and only as next step we upgrade/downgrade as needed
according to the actual "setup.py" we have in sources. This is so much faster
than installing from the scratch - build time in this case goes down from like
4 minutes to less than 1. This adds up in CI builds. If you have setup.py
change - until the change is merged into master (and DockerHub build
completes), it will re-run the pip install with every PR rebase.
Also the once-a-month full reinstall (which is now automated) is also good -
it will keep us from accumulating incremental changes.
I think this optimisation is a bit novel/unusual indeed but so far I found
no problems with that - you anyhow need access to PIP to install it.
What I can do as well, Is I can make failures in this step (and the
following pip-install) non-terminal. This way even if - for example - github is
not accessible, it will simply silently fail. This is really an idempotent step
- so the next step's 'pip install' will work fine regardless if this
"optimisation" one succeeds.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services