potiuk commented on issue #4483: [AIRFLOW-3673] Add official dockerfile
URL: https://github.com/apache/airflow/pull/4483#issuecomment-453778598
 
 
   @Fokko @ffinfo  -> i see this is merged now. I will still try to convince 
you anyway :)
   
   I will soon open the PR to the main apache line with some detailed 
calculations.  I will prepare some numbers and analysis including times to 
download and simulation of usage by the users - in order to show that rather 
than simply 'state' it.
   
   But just for now to answer your concerns @fokko:
   
   1) Docker caching is basically forever. It wont 'expire' until you change a 
line in Dockerfile / Docker and part of the context that are added in 
Dockerfile (for example sources of Airflow). What DockerHub is doing (well at 
least this is best practice and I am sure we can double check with them) is 
they are pulling the latest image from the same branch, use it as --cache-from 
and this way they will always reuse the layers that were already published. 
This is not a local machine caching - it is using published image layers as 
cache.
   
   2) Updates to apt-get packages:
   
   I've added these lines  below in my Docker image. Currently those need to be 
manually triggered, but if updates to latest versions of apt-get installed 
packages is a concern, this can be moved. Those lines can be always added after 
setup.py changes or sources added - this way latest versions will be upgraded 
frequently (or very frequently). And you will get the same as if you installed 
it from the scratch.
   
   'RUN apt-get update \
       && apt-get upgrade -y --no-install-recommends \
       && apt-get clean'
   
   This layer will grow over time - of course - but then periodically it's 
worth to rebuild it from the scratch (and that's what I also implemented - if 
you increase FORCE_REINSTALL_APT_GET_DEPENDENCIES env value the apt-get 
dependencies will be installed from the scratch). This way the apt-get install 
layer will be rebuild from scratch with latest dependencies (say every official 
release or every quarter). 
   
   3) Other images: PHP/Python/Ruby are actually "Baseline" kind of images. 
They have almost no "external" dependencies different than "core OS" - unlike 
Airfloe (which has quite a lot of them in fact).
   
   4) I think you looked at different layer than the one I was vocal about. The 
600 MB layer is exactly the one that will NOT be re-downloaded very frequently 
in my solution. Those are installed dependencies for airflow (without the 
actual airflow sources!). They don't change with every commit.
   
   With every commit where the sources are changed you will only re-download 
the '68bc9494193a' image from your example (10.4MB) which is a bit more than 
10% of the full image. This is because the 600MB layer will only be rebuilt 
when setup.py changes - otherwise it's taken from cache. Setup.py changes far 
less frequently than the sources - sources change every time but the setup.py 
around once every 2 weeks or so. Sometimes even there are 3 weeks where it 
remains unchanged. This means that during those two -three weeks anyone syncing 
last image will save significant amount of time/bandwidth. 
   
   Later on we could even split it even further  - using the already existing 
setup.py split to core (more stable)  and non-core (changing more frequently) 
dependencies. This way the big pip-install layer will be split even further and 
core part will be downloaded even less frequently.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to