Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom
view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. Forwarded Conversation Subject: Time to start publishing Spark Docker Images? ------------------------ From: Holden Karau <hol...@pigscanfly.ca> Date: Thu, 22 Jul 2021 at 04:13 To: dev <dev@spark.apache.org> Hi Folks, Many other distributed computing (https://hub.docker.com/r/rayproject/ray https://hub.docker.com/u/daskdev) and ASF projects ( https://hub.docker.com/u/apache) now publish their images to dockerhub. We've already got the docker image tooling in place, I think we'd need to ask the ASF to grant permissions to the PMC to publish containers and update the release steps but I think this could be useful for folks. Cheers, Holden -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau ---------- From: Kent Yao <yaooq...@gmail.com> Date: Thu, 22 Jul 2021 at 04:22 To: Holden Karau <hol...@pigscanfly.ca> Cc: dev <dev@spark.apache.org> +1 Bests, *Kent Yao * @ Data Science Center, Hangzhou Research Institute, NetEase Corp. *a spark enthusiast* *kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark <http://spark.apache.org/>.* *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark SQL extension which provides SQL Standard Authorization for **Apache Spark <http://spark.apache.org/>.* *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.* *itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat brings useful functions from various modern database management systems to **Apache Spark <http://spark.apache.org/>.* --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org ---------- From: Hyukjin Kwon <gurwls...@gmail.com> Date: Fri, 13 Aug 2021 at 01:44 To: Kent Yao <yaooq...@gmail.com>, Dongjoon Hyun <dongj...@apache.org> Cc: Holden Karau <hol...@pigscanfly.ca>, dev <dev@spark.apache.org> +1, I think we generally agreed upon having it. Thanks Holden for headsup and driving this. +@Dongjoon Hyun <dongj...@apache.org> FYI 2021년 7월 22일 (목) 오후 12:22, Kent Yao <yaooq...@gmail.com>님이 작성: ---------- From: John Zhuge <jzh...@apache.org> Date: Fri, 13 Aug 2021 at 01:48 To: Hyukjin Kwon <gurwls...@gmail.com> Cc: Dongjoon Hyun <dongj...@apache.org>, Holden Karau <hol...@pigscanfly.ca>, Kent Yao <yaooq...@gmail.com>, dev <dev@spark.apache.org> +1 -- John Zhuge ---------- From: Holden Karau <hol...@pigscanfly.ca> Date: Fri, 13 Aug 2021 at 01:54 To: John Zhuge <jzh...@apache.org> Cc: Hyukjin Kwon <gurwls...@gmail.com>, Dongjoon Hyun <dongj...@apache.org>, Kent Yao <yaooq...@gmail.com>, dev <dev@spark.apache.org> Awesome, I've filed an INFRA ticket to get the ball rolling. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Fri, 13 Aug 2021 at 07:45 To: Cc: dev <dev@spark.apache.org> I concur this is a good idea and certainly worth exploring. In practice, preparing docker images as deployable will throw some challenges because creating docker for Spark is not really a singular modular unit, say creating docker for Jenkins. It involves different versions and different images for Spark and PySpark and most likely will end up as part of Kubernetes deployment. Individuals and organisations will deploy it as the first cut. Great but I equally feel that good documentation on how to build a consumable deployable image will be more valuable. FRom my own experience the current documentation should be enhanced, for example how to deploy working directories, additional Python packages, build with different Java versions (version 8 or version 11) etc. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. ---------- From: Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de> Date: Fri, 13 Aug 2021 at 08:13 To: dev <dev@spark.apache.org> Hi all, I am Meikel Bode and only an interested reader of dev and user list. Anyway, I would appreciate to have official docker images available. Maybe one could get inspiration from the Jupyter docker stacks and provide an hierarchy of different images like this: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#image-relationships Having a core image only supporting Java, an extended supporting Python and/or R etc. Looking forward to the discussion. Best, Meikel ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Fri, 13 Aug 2021 at 08:51 To: Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de> Cc: dev <dev@spark.apache.org> Agreed. I have already built a few latest for Spark and PYSpark on 3.1.1 with Java 8 as I found out Java 11 does not work with Google BigQuery data warehouse. However, to hack the Dockerfile one finds out the hard way. For example how to add additional Python libraries like tensorflow etc. Loading these libraries through Kubernetes is not practical as unzipping and installing it through --py-files etc will take considerable time so they need to be added to the dockerfile at the built time in directory for Python under Kubernetes /opt/spark/kubernetes/dockerfiles/spark/bindings/python RUN pip install pyyaml numpy cx_Oracle tensorflow .... Also you will need curl to test the ports from inside the docker RUN apt-get update && apt-get install -y curl RUN ["apt-get","install","-y","vim"] As I said I am happy to build these specific dockerfiles plus the complete documentation for it. I have already built one for Google (GCP). The difference between Spark and PySpark version is that in Spark/scala a fat jar file will contain all needed. That is not the case with Python I am afraid. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Fri, 13 Aug 2021 at 08:59 To: Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de> Cc: dev <dev@spark.apache.org> should read PySpark ---------- From: Holden Karau <hol...@pigscanfly.ca> Date: Fri, 13 Aug 2021 at 17:26 To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de>, dev < dev@spark.apache.org> So we actually do have a script that does the build already it's more a matter of publishing the results for easier use. Currently the script produces three images spark, spark-py, and spark-r. I can certainly see a solid reason to publish like with a jdk11 & jdk8 suffix as well if there is interest in the community. If we want to have a say spark-py-pandas for a Spark container image with everything necessary for the Koalas stuff to work then I think that could be a great PR from someone to add :) ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Fri, 13 Aug 2021 at 23:43 To: Holden Karau <hol...@pigscanfly.ca> Cc: Bode, Meikel, NMA-CFD <meikel.b...@bertelsmann.de>, dev < dev@spark.apache.org> Hi, We can cater for multiple types (spark, spark-py and spark-r) and spark versions (assuming they are downloaded and available). The challenge is that these docker images built are snapshots. They cannot be amended later and if you change anything by going inside docker, as soon as you are logged out whatever you did is reversed. For example, I want to add tensorflow to my docker image. These are my images REPOSITORY TAG IMAGE ID CREATED SIZE eu.gcr.io/axial-glow-224522/spark-py java8_3.1.1 cfbb0e69f204 5 days ago 2.37GB eu.gcr.io/axial-glow-224522/spark 3.1.1 8d1bf8e7e47d 5 days ago 805MB using image ID I try to log in as root to the image *docker run -u0 -it cfbb0e69f204 bash* root@b542b0f1483d:/opt/spark/work-dir# pip install keras Collecting keras Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB) |████████████████████████████████| 1.3 MB 1.1 MB/s Installing collected packages: keras Successfully installed keras-2.6.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv root@b542b0f1483d:/opt/spark/work-dir# pip list Package Version ------------- ------- asn1crypto 0.24.0 cryptography 2.6.1 cx-Oracle 8.2.1 entrypoints 0.3 *keras 2.6.0 <--- it is here* keyring 17.1.1 keyrings.alt 3.1.1 numpy 1.21.1 pip 21.2.3 py4j 0.10.9 pycrypto 2.6.1 PyGObject 3.30.4 pyspark 3.1.2 pyxdg 0.25 PyYAML 5.4.1 SecretStorage 2.3.1 setuptools 57.4.0 six 1.12.0 wheel 0.32.3 root@b542b0f1483d:/opt/spark/work-dir# exit Now I exited from the image and try to log in again (pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker run -u0 -it cfbb0e69f204 bash root@5231ee95aa83:/opt/spark/work-dir# pip list Package Version ------------- ------- asn1crypto 0.24.0 cryptography 2.6.1 cx-Oracle 8.2.1 entrypoints 0.3 keyring 17.1.1 keyrings.alt 3.1.1 numpy 1.21.1 pip 21.2.3 py4j 0.10.9 pycrypto 2.6.1 PyGObject 3.30.4 pyspark 3.1.2 pyxdg 0.25 PyYAML 5.4.1 SecretStorage 2.3.1 setuptools 57.4.0 six 1.12.0 wheel 0.32.3 *Hm that keras is not there*. The docker Image cannot be altered after build! So once the docker image is created that is just a snapshot. However, it will still have tons of useful stuff for most users/organisations. My suggestions is to create for a given type (spark, spark-py etc): 1. One vanilla flavour for everyday use with few useful packages 2. One for medium use with most common packages for ETL/ELT stuff 3. One specialist for ML etc with keras, tensorflow and anything else needed These images should be maintained as we currently maintain spark releases with accompanying documentation. Any reason why we cannot maintain ourselves? ---------- From: Maciej <mszymkiew...@gmail.com> Date: Mon, 16 Aug 2021 at 18:46 To: <dev@spark.apache.org> I have a few concerns regarding PySpark and SparkR images. First of all, how do we plan to handle interpreter versions? Ideally, we should provide images for all supported variants, but based on the preceding discussion and the proposed naming convention, I assume it is not going to happen. If that's the case, it would be great if we could fix interpreter versions based on some support criteria (lowest supported, lowest non-deprecated, highest supported at the time of release, etc.) Currently, we use the following: - for R use buster-cran35 Debian repositories which install R 3.6 (provided version already changed in the past and broke image build ‒ SPARK-28606). - for Python we depend on the system provided python3 packages, which currently provides Python 3.7. which don't guarantee stability over time and might be hard to synchronize with our support matrix. Secondly, omitting libraries which are required for the full functionality and performance, specifically - Numpy, Pandas and Arrow for PySpark - Arrow for SparkR is likely to severely limit usability of the images (out of these, Arrow is probably the hardest to manage, especially when you already depend on system packages to provide R or Python interpreter). -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC ---------- From: Holden Karau <hol...@pigscanfly.ca> Date: Tue, 17 Aug 2021 at 03:05 To: Maciej <mszymkiew...@gmail.com> Cc: <dev@spark.apache.org> These are some really good points all around. I think, in the interest of simplicity, well start with just the 3 current Dockerfiles in the Spark repo but for the next release (3.3) we should explore adding some more Dockerfiles/build options. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 07:27 To: Cc: Spark dev list <dev@spark.apache.org> Thanks for the notes all. I think we ought to consider what docker general usage is. Docker image by definition is a self contained general purpose entity providing spark service at the common denominator. Some docker images like the one for jenkins are far simpler to build as they have less moving parts With spark you either deploy your own set-up, get a service provider like Google to provide it as a service (and they provide not the most recent version, for example Dataproc cluster runs on Spark 3.1.1) or use docker in Kubernetes cluster GKE. They provide the cluster and cluster management but you deploy your own docker image. I don't know much about spark-R but I think if we take Spark latest and spark-py built on 3.7.3 and Spark-py for data science with the most widely used and current packages we should be OK. This set-up will work as long as the documentation goes into details of interpreter and package versions and provides a blueprint to build your own custom version of docker with whatever version you prefer. As I said if we look at flink, they provide flink with scala and java version and of course latest 1.13.2-scala_2.12-java8, 1.13-scala_2.12-java8, scala_2.12-java8, 1.13.2-scala_2.12, 1.13-scala_2.12, scala_2.12, 1.13.2-java8, 1.13-java8, java8, 1.13.2, 1.13, latest I personally believe that providing the most popular ones serves the purpose for the community and anything above and beyond has to be tailor made. ---------- From: Holden Karau <hol...@pigscanfly.ca> Date: Tue, 17 Aug 2021 at 07:36 To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Spark dev list <dev@spark.apache.org> I as well think the largest use case of docker images would be on Kubernetes. While I have nothing against us adding more varieties I think it’s important for us to get this started with our current containers, so I’ll do that but let’s certainly continue exploring improvements after that. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 07:40 To: Holden Karau <hol...@pigscanfly.ca> Cc: Spark dev list <dev@spark.apache.org> An interesting point. Do we have a repository for current containers. I am not aware of it. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 09:05 To: Maciej <mszymkiew...@gmail.com> Cc: Spark dev list <dev@spark.apache.org> Of course with PySpark, there is the option of putting your packages in gz format and send them at spark-submit time --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \ However, in the Kubernetes cluster that file is going to be fairly massive and will take time to unzip and share. The interpreter will be what it comes with the docker image! view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 16 Aug 2021 at 18:46, Maciej <mszymkiew...@gmail.com> wrote: ---------- From: Maciej <mszymkiew...@gmail.com> Date: Tue, 17 Aug 2021 at 09:42 To: <dev@spark.apache.org> You're right, but with the native dependencies (this is the case for the packages I've mentioned before) we have to bundle complete environments. It is doable, but if you do that, you're actually better off with base image. I don't insist it is something we have to address right now, just something to keep in mind. ---------- From: Maciej <mszymkiew...@gmail.com> Date: Tue, 17 Aug 2021 at 09:51 To: Holden Karau <hol...@pigscanfly.ca> Cc: <dev@spark.apache.org> On 8/17/21 4:04 AM, Holden Karau wrote: These are some really good points all around. I think, in the interest of simplicity, well start with just the 3 current Dockerfiles in the Spark repo but for the next release (3.3) we should explore adding some more Dockerfiles/build options. Sounds good. However, I'd consider adding guest lang version to the tag names, i.e. 3.1.2_sparkpy_3.7-scala_2.12-java11 3.1.2_sparkR_3.6-scala_2.12-java11 and some basics safeguards in the layers, to make sure that these are really the versions we use. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 10:31 To: Maciej <mszymkiew...@gmail.com> Cc: Holden Karau <hol...@pigscanfly.ca>, Spark dev list < dev@spark.apache.org> 3.1.2_sparkpy_3.7-scala_2.12-java11 3.1.2_sparkR_3.6-scala_2.12-java11 Yes let us go with that and remember that we can change the tags anytime. The accompanying release note should detail what is inside the image downloaded. +1 for me -- >> Best regards, >> Maciej Szymkiewicz >> >> Web: https://zero323.net >> Keybase: https://keybase.io/zero323 >> Gigs: https://www.codementor.io/@zero323 >> PGP: A30CEF0C31A501EC >> >> -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 13:24 To: Cc: Spark dev list <dev@spark.apache.org> Examples: *docker images* REPOSITORY TAG IMAGE ID CREATED SIZE spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8 ba3c17bc9337 2 minutes ago 2.19GB spark 3.1.1-scala_2.12-java11 4595c4e78879 18 minutes ago 635MB ---------- From: Maciej <mszymkiew...@gmail.com> Date: Tue, 17 Aug 2021 at 16:17 To: <dev@spark.apache.org> Quick question ‒ is this actual output? If so, do we know what accounts 1.5GB overhead for PySpark image. Even without --no-install-recommends this seems like a lot (if I recall correctly it was around 400MB for existing images). ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 16:24 To: Maciej <mszymkiew...@gmail.com> Cc: Spark dev list <dev@spark.apache.org> Yes, I will double check. it includes java 8 in addition to base java 11. in addition it has these Python packages for now (added for my own needs for now) root@ce6773017a14:/opt/spark/work-dir# pip list Package Version ------------- ------- asn1crypto 0.24.0 cryptography 2.6.1 cx-Oracle 8.2.1 entrypoints 0.3 keyring 17.1.1 keyrings.alt 3.1.1 numpy 1.21.2 pip 21.2.4 py4j 0.10.9 pycrypto 2.6.1 PyGObject 3.30.4 pyspark 3.1.2 pyxdg 0.25 PyYAML 5.4.1 SecretStorage 2.3.1 setuptools 57.4.0 six 1.12.0 wheel 0.32.3 ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 16:55 To: Maciej <mszymkiew...@gmail.com> Cc: Spark dev list <dev@spark.apache.org> With no additional python packages etc we get 1.4GB compared to 2.19GB before REPOSITORY TAG IMAGE ID CREATED SIZE spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8only faee4dbb95dd Less than a second ago 1.41GB spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8 ba3c17bc9337 4 hours ago 2.19GB root@233a81199b43:/opt/spark/work-dir# pip list Package Version ------------- ------- asn1crypto 0.24.0 cryptography 2.6.1 entrypoints 0.3 keyring 17.1.1 keyrings.alt 3.1.1 pip 21.2.4 pycrypto 2.6.1 PyGObject 3.30.4 pyxdg 0.25 ---------- From: Andrew Melo <andrew.m...@gmail.com> Date: Tue, 17 Aug 2021 at 16:57 To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> Silly Q, did you blow away the pip cache before committing the layer? That always trips me up. Cheers Andrew -- It's dark in this basement. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 20:29 To: Andrew Melo <andrew.m...@gmail.com> Cc: Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> Hi Andrew, Can you please elaborate on blowing pip cache before committing the layer? Thanks, Much ---------- From: Andrew Melo <andrew.m...@gmail.com> Date: Tue, 17 Aug 2021 at 20:44 To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> Hi Mich, By default, pip caches downloaded binaries to somewhere like $HOME/.cache/pip. So after doing any "pip install", you'll want to either delete that directory, or pass the "--no-cache-dir" option to pip to prevent the download binaries from being added to the image. HTH Andrew ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 23:05 To: Andrew Melo <andrew.m...@gmail.com> Cc: Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> Thanks Andrew, that was helpful. Step 10/23 : RUN pip install pyyaml numpy cx_Oracle pyspark --no-cache-dir And the reduction in size is considerable, 1.75GB vs 2.19GB . Note that the original run has now been invalidated REPOSITORY TAG IMAGE ID CREATED SIZE spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8 ecef8bd15731 Less than a second ago 1.75GB <none> <none> ba3c17bc9337 10 hours ago 2.19GB ---------- From: Holden Karau <hol...@pigscanfly.ca> Date: Tue, 17 Aug 2021 at 23:26 To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Andrew Melo <andrew.m...@gmail.com>, Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> pip installing pyspark like that probably isn't a great idea since there isn't a version tagged to it. Probably better to install from the local files copied in than potentially from pypi. Might be able to install in -e mode where it'll do symlinks to save space I'm not sure. ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Tue, 17 Aug 2021 at 23:52 To: Holden Karau <hol...@pigscanfly.ca> Cc: Andrew Melo <andrew.m...@gmail.com>, Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> Well, we need to decide what packages need to be installed with spark-py. PySpark is not one of them. true The docker build itself takes care of PySpark by copying them from the $SPARK_HOME directory COPY python/pyspark ${SPARK_HOME}/python/pyspark COPY python/lib ${SPARK_HOME}/python/lib Please review the docker file for python in $SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dockerfile and make changes needed. ARG base_img FROM $base_img WORKDIR / # Reset to root to run installation tasks USER 0 RUN mkdir ${SPARK_HOME}/python RUN apt-get update && \ apt install -y python3 python3-pip && \ pip3 install --upgrade pip setuptools && \ # Removed the .cache to save space rm -r /root/.cache && rm -rf /var/cache/apt/* COPY python/pyspark ${SPARK_HOME}/python/pyspark COPY python/lib ${SPARK_HOME}/python/lib WORKDIR /opt/spark/work-dir ENTRYPOINT [ "/opt/entrypoint.sh" ] # Specify the User that the actual main process will run as ARG spark_uid=185 USER ${spark_uid} ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Wed, 18 Aug 2021 at 10:10 To: Holden Karau <hol...@pigscanfly.ca> Cc: Andrew Melo <andrew.m...@gmail.com>, Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> A rather related point The docker image comes with the following java root@73a798cc3303:/opt/spark/work-dir# java -version openjdk version "11.0.12" 2021-07-20 OpenJDK Runtime Environment 18.9 (build 11.0.12+7) OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7, mixed mode, sharing) For Java 8 I believe debian buster does not support Java 8,. This will be added to the docker image. Any particular java 8 we should go for. For now I am using jdk1.8.0_201 which is Oracle Java. Current debian versions built in GCP use openjdk version "1.8.0_292" Shall we choose and adopt one java 8 version for docker images? This will be in addition to java 11 already installed with base ---------- From: Holden Karau <hol...@pigscanfly.ca> Date: Wed, 18 Aug 2021 at 21:27 To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Andrew Melo <andrew.m...@gmail.com>, Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> So the default image we use right now for the 3.2 line is 11-jre-slim, in 3.0 we used 8-jre-slim, I think these are ok bases for us to build from unless someone has a good reason otherwise? ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Wed, 18 Aug 2021 at 23:42 To: Holden Karau <hol...@pigscanfly.ca> Cc: Andrew Melo <andrew.m...@gmail.com>, Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> We have both base images now REPOSITORY TAG IMAGE ID CREATED SIZE openjdk 8-jre-slim 0d0a85fdf642 40 hours ago 187MB openjdk 11-jre-slim eb77da2ec13c 3 weeks ago 221MB Only java version differences: For 11-jre-slim we have: ARG java_image_tag=*11-jre-slim* FROM openjdk:${java_image_tag} And for 8-jre-slim ARG java_image_tag=*8-jre-slim* FROM openjdk:${java_image_tag} ---------- From: Ankit Gupta <info.ank...@gmail.com> Date: Sat, 21 Aug 2021 at 15:50 To: Mich Talebzadeh <mich.talebza...@gmail.com> Cc: Holden Karau <hol...@pigscanfly.ca>, Andrew Melo <andrew.m...@gmail.com>, Maciej <mszymkiew...@gmail.com>, Spark dev list <dev@spark.apache.org> Hey All Just a suggestion, or maybe a future enhancement, we should also try and use different base OSs like buster, alpine, slim, stretch, etc. and add that in the tag as well. This will help the users to choose the images according to their requirements. Thanks and Regards. Ankit Prakash Gupta info.ank...@gmail.com LinkedIn : https://www.linkedin.com/in/infoankitp/ Medium: https://medium.com/@info.ankitp ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Sat, 25 Dec 2021 at 16:24 To: Cc: Spark dev list <dev@spark.apache.org> Season's greetings to all. A while back we discussed publishing docker images, mainly for Kubernetes. Increasing number of people are using Spark on Kubernetes. Following our previous discussions, what matters is the tag, which is the detailed identifier of the image used. These images are normally loaded to the container/artefact registories in Cloud For example with SPARK_VERSION, SCALA_VERSION, DOCKERIMAGETAG, BASE_OS and the used DOCKERFILE export PROJECT_ID=$(gcloud info --format='value(config.project)') export GCP_CR=eu.gcr.io/${PROJECT_ID} <http://eu.gcr.io/$%7BPROJECT_ID%7D> BASE_OS="buster" SPARK_VERSION="3.1.1" SCALA_VERSION="scala_2.12" DOCKERFILE="java8PlusPackages" DOCKERIMAGETAG="8-jre-slim" cd $SPARK_HOME # Building Docker image from provided Dockerfile base 11 cd $SPARK_HOME /opt/spark/bin/docker-image-tool.sh \ -r $GCP_CR \ -t ${SPARK_VERSION}-${SCALA_VERSION}-${DOCKERIMAGETAG}-${BASE_OS}-${DOCKERFILE} \ -b java_image_tag=${DOCKERIMAGETAG} \ -p ./kubernetes/dockerfiles/spark/bindings/python/${DOCKERFILE} \ build This results in a docker image created with a tag IMAGEDRIVER="eu.gcr.io/ <PROJECT_ID>/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-java8PlusPackages" and --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \ --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \ The question is do we need anything else in the tag itself or enough info is provided? Cheers ---------- From: Mich Talebzadeh <mich.talebza...@gmail.com> Date: Mon, 21 Feb 2022 at 22:08 To: Holden Karau <hol...@pigscanfly.ca> Cc: dev <dev@spark.apache.org> forwarded view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
OpenPGP_signature
Description: PGP signature
OpenPGP_signature
Description: PGP signature
OpenPGP_signature
Description: PGP signature
OpenPGP_signature
Description: PGP signature
--------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org