potiuk commented on issue #4483: [AIRFLOW-3673] Add official dockerfile URL: https://github.com/apache/airflow/pull/4483#issuecomment-453778598 @Fokko @ffinfo -> i see this is merged now. I will still try to convince you anyway :) I will soon open the PR to the main apache line with some detailed calculations. I will prepare some numbers and analysis including times to download and simulation of usage by the users - in order to show that rather than simply 'state' it. But just for now to answer your concerns @fokko: 1) Docker caching is basically forever. It wont 'expire' until you change a line in Dockerfile / Docker and part of the context that are added in Dockerfile (for example sources of Airflow). What DockerHub is doing (well at least this is best practice and I am sure we can double check with them) is they are pulling the latest image from the same branch, use it as --cache-from and this way they will always reuse the layers that were already published. This is not a local machine caching - it is using published image layers as cache. 2) Updates to apt-get packages: I've added these lines below in my Docker image. Currently those need to be manually triggered, but if updates to latest versions of apt-get installed packages is a concern, this can be moved. Those lines can be always added after setup.py changes or sources added - this way latest versions will be upgraded frequently (or very frequently). And you will get the same as if you installed it from the scratch. 'RUN apt-get update \ && apt-get upgrade -y --no-install-recommends \ && apt-get clean' This layer will grow over time - of course - but then periodically it's worth to rebuild it from the scratch (and that's what I also implemented - if you increase FORCE_REINSTALL_APT_GET_DEPENDENCIES env value the apt-get dependencies will be installed from the scratch). This way the apt-get install layer will be rebuild from scratch with latest dependencies (say every official release or every quarter). 3) Other images: PHP/Python/Ruby are actually "Baseline" kind of images. They have almost no "external" dependencies different than "core OS" - unlike Airfloe (which has quite a lot of them in fact). 4) I think you looked at different layer than the one I was vocal about. The 600 MB layer is exactly the one that will NOT be re-downloaded very frequently in my solution. Those are installed dependencies for airflow (without the actual airflow sources!). They don't change with every commit. With every commit where the sources are changed you will only re-download the '68bc9494193a' image from your example (10.4MB) which is a bit more than 10% of the full image. This is because the 600MB layer will only be rebuilt when setup.py changes - otherwise it's taken from cache. Setup.py changes far less frequently than the sources - sources change every time but the setup.py around once every 2 weeks or so. Sometimes even there are 3 weeks where it remains unchanged. This means that during those two -three weeks anyone syncing last image will save significant amount of time/bandwidth. Later on we could even split it even further - using the already existing setup.py split to core (more stable) and non-core (changing more frequently) dependencies. This way the big pip-install layer will be split even further and core part will be downloaded even less frequently.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
