Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks. 1. For legal questions, please see the following three Apache-approved approaches. We can follow one of them.
1. https://hub.docker.com/u/apache (93 repositories, Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...) 2. https://hub.docker.com/_/solr (This is also official. There are more instances like this.) 3. https://hub.docker.com/u/apachestreampipes (Some projects tries this form.) 2. For non-Spark dev-environment images, definitely it will help both our Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub Action secret like the following. https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker Hub secret for Github Actions 3. For Spark image content questions, we should not do the following. It's because not only for legal issues, but also we cannot contain or maintain all popular libraries like Nvidia library/TensorFlow in our image. https://issues.apache.org/jira/browse/SPARK-26398 Support building GPU docker images 4. The way I see this is a minimal legal image containing only our artifacts from the followings. We can check the other Apache repos's best practice. https://www.apache.org/dist/spark/ 5. For OS/Java/Python/R runtimes and libraries, those (except OS) can be overlayed as an additional layers by the users in general. I don't think we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x (JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries). Specifically, I don't think we need to install all libraries like `arrow`. 6. For the target users, this is a general docker image. We don't need to assume that this is for K8s-only environment. This can be used in any Docker environment. 7. For the number of images, as suggested in this thread, we may want to follow our existing K8s integration test suite way by splitting PySpark and R images from Java. But, I don't have any requirement for this. What I want to propose in this thread is that we can start with a minimal viable product and evolve them (if needed) as an open source community. Bests, Dongjoon. PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website, our distribution repo, Maven Central, PyPi, CRAN, Homebrew. I'm preparing website news and download page update. On Thu, Feb 6, 2020 at 11:19 AM Tom Graves <tgraves...@yahoo.com> wrote: > When discussions of docker have occurred in the past - mostly related to > k8s - there is a lot of discussion about what is the right image to > publish, as well as making sure Apache is ok with it. Apache official > release is the source code so we may need to make sure to have disclaimer > and we need to make sure it doesn't contain anything licensed that it > shouldn't. What happens when one of the docker images we publish has > security update. We would need to make sure all the legal bases are covered > first. > > Then the discussion comes into what is in the docker images and how useful > it is. People run different os's, different python versions, etc. And like > Sean mentioned how useful really is it other then a few examples. Some > discussions on https://issues.apache.org/jira/browse/SPARK-24655 > > Tom > > > > On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun < > dongjoon.h...@gmail.com> wrote: > > > Hi, All. > > From 2020, shall we have an official Docker image repository as an > additional distribution channel? > > I'm considering the following images. > > - Public binary release (no snapshot image) > - Public non-Spark base image (OS + R + Python) > (This can be used in GitHub Action Jobs and Jenkins K8s Integration > Tests to speed up jobs and to have more stabler environments) > > Bests, > Dongjoon. >