pip installing pyspark like that probably isn't a great idea since there isn't a version tagged to it. Probably better to install from the local files copied in than potentially from pypi. Might be able to install in -e mode where it'll do symlinks to save space I'm not sure.
On Tue, Aug 17, 2021 at 3:12 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks Andrew, that was helpful. > > Step 10/23 : RUN pip install pyyaml numpy cx_Oracle pyspark --no-cache-dir > > And the reduction in size is considerable, 1.75GB vs 2.19GB . Note that > the original run has now been invalidated > > REPOSITORY TAG IMAGE ID > CREATED SIZE > spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8 ecef8bd15731 > Less than a second ago 1.75GB > <none> <none> ba3c17bc9337 > 10 hours ago 2.19GB > > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 17 Aug 2021 at 20:44, Andrew Melo <andrew.m...@gmail.com> wrote: > >> Hi Mich, >> >> By default, pip caches downloaded binaries to somewhere like >> $HOME/.cache/pip. So after doing any "pip install", you'll want to either >> delete that directory, or pass the "--no-cache-dir" option to pip to >> prevent the download binaries from being added to the image. >> >> HTH >> Andrew >> >> On Tue, Aug 17, 2021 at 2:29 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Andrew, >>> >>> Can you please elaborate on blowing pip cache before committing the >>> layer? >>> >>> Thanks, >>> >>> Much >>> >>> On Tue, 17 Aug 2021 at 16:57, Andrew Melo <andrew.m...@gmail.com> wrote: >>> >>>> Silly Q, did you blow away the pip cache before committing the layer? >>>> That always trips me up. >>>> >>>> Cheers >>>> Andrew >>>> >>>> On Tue, Aug 17, 2021 at 10:56 Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> With no additional python packages etc we get 1.4GB compared to 2.19GB >>>>> before >>>>> >>>>> REPOSITORY TAG IMAGE ID >>>>> CREATED SIZE >>>>> spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8only >>>>> faee4dbb95dd Less than a second ago 1.41GB >>>>> spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8 >>>>> ba3c17bc9337 4 hours ago 2.19GB >>>>> >>>>> root@233a81199b43:/opt/spark/work-dir# pip list >>>>> Package Version >>>>> ------------- ------- >>>>> asn1crypto 0.24.0 >>>>> cryptography 2.6.1 >>>>> entrypoints 0.3 >>>>> keyring 17.1.1 >>>>> keyrings.alt 3.1.1 >>>>> pip 21.2.4 >>>>> pycrypto 2.6.1 >>>>> PyGObject 3.30.4 >>>>> pyxdg 0.25 >>>>> SecretStorage 2.3.1 >>>>> setuptools 57.4.0 >>>>> six 1.12.0 >>>>> wheel 0.32.3 >>>>> >>>>> >>>>> HTH >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 17 Aug 2021 at 16:24, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Yes, I will double check. it includes java 8 in addition to base java >>>>>> 11. >>>>>> >>>>>> in addition it has these Python packages for now (added for my own >>>>>> needs for now) >>>>>> >>>>>> root@ce6773017a14:/opt/spark/work-dir# pip list >>>>>> Package Version >>>>>> ------------- ------- >>>>>> asn1crypto 0.24.0 >>>>>> cryptography 2.6.1 >>>>>> cx-Oracle 8.2.1 >>>>>> entrypoints 0.3 >>>>>> keyring 17.1.1 >>>>>> keyrings.alt 3.1.1 >>>>>> numpy 1.21.2 >>>>>> pip 21.2.4 >>>>>> py4j 0.10.9 >>>>>> pycrypto 2.6.1 >>>>>> PyGObject 3.30.4 >>>>>> pyspark 3.1.2 >>>>>> pyxdg 0.25 >>>>>> PyYAML 5.4.1 >>>>>> SecretStorage 2.3.1 >>>>>> setuptools 57.4.0 >>>>>> six 1.12.0 >>>>>> wheel 0.32.3 >>>>>> >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 17 Aug 2021 at 16:17, Maciej <mszymkiew...@gmail.com> wrote: >>>>>> >>>>>>> Quick question ‒ is this actual output? If so, do we know what >>>>>>> accounts 1.5GB overhead for PySpark image. Even without >>>>>>> --no-install-recommends this seems like a lot (if I recall >>>>>>> correctly it was around 400MB for existing images). >>>>>>> >>>>>>> >>>>>>> On 8/17/21 2:24 PM, Mich Talebzadeh wrote: >>>>>>> >>>>>>> Examples: >>>>>>> >>>>>>> *docker images* >>>>>>> >>>>>>> REPOSITORY TAG IMAGE ID >>>>>>> CREATED SIZE >>>>>>> >>>>>>> spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8 ba3c17bc9337 >>>>>>> 2 minutes ago 2.19GB >>>>>>> >>>>>>> spark 3.1.1-scala_2.12-java11 4595c4e78879 >>>>>>> 18 minutes ago 635MB >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, 17 Aug 2021 at 10:31, Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> 3.1.2_sparkpy_3.7-scala_2.12-java11 >>>>>>>> >>>>>>>> 3.1.2_sparkR_3.6-scala_2.12-java11 >>>>>>>> Yes let us go with that and remember that we can change the tags >>>>>>>> anytime. The accompanying release note should detail what is inside the >>>>>>>> image downloaded. >>>>>>>> >>>>>>>> +1 for me >>>>>>>> >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>> which may >>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages >>>>>>>> arising from such loss, damage or destruction. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, 17 Aug 2021 at 09:51, Maciej <mszymkiew...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On 8/17/21 4:04 AM, Holden Karau wrote: >>>>>>>>> >>>>>>>>> These are some really good points all around. >>>>>>>>> >>>>>>>>> I think, in the interest of simplicity, well start with just the 3 >>>>>>>>> current Dockerfiles in the Spark repo but for the next release (3.3) >>>>>>>>> we >>>>>>>>> should explore adding some more Dockerfiles/build options. >>>>>>>>> >>>>>>>>> Sounds good. >>>>>>>>> >>>>>>>>> However, I'd consider adding guest lang version to the tag names, >>>>>>>>> i.e. >>>>>>>>> >>>>>>>>> 3.1.2_sparkpy_3.7-scala_2.12-java11 >>>>>>>>> >>>>>>>>> 3.1.2_sparkR_3.6-scala_2.12-java11 >>>>>>>>> >>>>>>>>> and some basics safeguards in the layers, to make sure that these >>>>>>>>> are really the versions we use. >>>>>>>>> >>>>>>>>> On Mon, Aug 16, 2021 at 10:46 AM Maciej <mszymkiew...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I have a few concerns regarding PySpark and SparkR images. >>>>>>>>>> >>>>>>>>>> First of all, how do we plan to handle interpreter versions? >>>>>>>>>> Ideally, we should provide images for all supported variants, but >>>>>>>>>> based on >>>>>>>>>> the preceding discussion and the proposed naming convention, I >>>>>>>>>> assume it is >>>>>>>>>> not going to happen. If that's the case, it would be great if we >>>>>>>>>> could fix >>>>>>>>>> interpreter versions based on some support criteria (lowest >>>>>>>>>> supported, >>>>>>>>>> lowest non-deprecated, highest supported at the time of release, >>>>>>>>>> etc.) >>>>>>>>>> >>>>>>>>>> Currently, we use the following: >>>>>>>>>> >>>>>>>>>> - for R use buster-cran35 Debian repositories which install R >>>>>>>>>> 3.6 (provided version already changed in the past and broke image >>>>>>>>>> build ‒ >>>>>>>>>> SPARK-28606). >>>>>>>>>> - for Python we depend on the system provided python3 >>>>>>>>>> packages, which currently provides Python 3.7. >>>>>>>>>> >>>>>>>>>> which don't guarantee stability over time and might be hard to >>>>>>>>>> synchronize with our support matrix. >>>>>>>>>> >>>>>>>>>> Secondly, omitting libraries which are required for the full >>>>>>>>>> functionality and performance, specifically >>>>>>>>>> >>>>>>>>>> - Numpy, Pandas and Arrow for PySpark >>>>>>>>>> - Arrow for SparkR >>>>>>>>>> >>>>>>>>>> is likely to severely limit usability of the images (out of >>>>>>>>>> these, Arrow is probably the hardest to manage, especially when you >>>>>>>>>> already >>>>>>>>>> depend on system packages to provide R or Python interpreter). >>>>>>>>>> >>>>>>>>>> On 8/14/21 12:43 AM, Mich Talebzadeh wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> We can cater for multiple types (spark, spark-py and spark-r) and >>>>>>>>>> spark versions (assuming they are downloaded and available). >>>>>>>>>> The challenge is that these docker images built are snapshots. >>>>>>>>>> They cannot be amended later and if you change anything by going >>>>>>>>>> inside >>>>>>>>>> docker, as soon as you are logged out whatever you did is reversed. >>>>>>>>>> >>>>>>>>>> For example, I want to add tensorflow to my docker image. These >>>>>>>>>> are my images >>>>>>>>>> >>>>>>>>>> REPOSITORY TAG IMAGE ID >>>>>>>>>> CREATED SIZE >>>>>>>>>> eu.gcr.io/axial-glow-224522/spark-py java8_3.1.1 >>>>>>>>>> cfbb0e69f204 5 days ago 2.37GB >>>>>>>>>> eu.gcr.io/axial-glow-224522/spark 3.1.1 >>>>>>>>>> 8d1bf8e7e47d 5 days ago 805MB >>>>>>>>>> >>>>>>>>>> using image ID I try to log in as root to the image >>>>>>>>>> >>>>>>>>>> *docker run -u0 -it cfbb0e69f204 bash* >>>>>>>>>> >>>>>>>>>> root@b542b0f1483d:/opt/spark/work-dir# pip install keras >>>>>>>>>> Collecting keras >>>>>>>>>> Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB) >>>>>>>>>> |████████████████████████████████| 1.3 MB 1.1 MB/s >>>>>>>>>> Installing collected packages: keras >>>>>>>>>> Successfully installed keras-2.6.0 >>>>>>>>>> WARNING: Running pip as the 'root' user can result in broken >>>>>>>>>> permissions and conflicting behaviour with the system package >>>>>>>>>> manager. It >>>>>>>>>> is recommended to use a virtual environment instead: >>>>>>>>>> https://pip.pypa.io/warnings/venv >>>>>>>>>> root@b542b0f1483d:/opt/spark/work-dir# pip list >>>>>>>>>> Package Version >>>>>>>>>> ------------- ------- >>>>>>>>>> asn1crypto 0.24.0 >>>>>>>>>> cryptography 2.6.1 >>>>>>>>>> cx-Oracle 8.2.1 >>>>>>>>>> entrypoints 0.3 >>>>>>>>>> *keras 2.6.0 <--- it is here* >>>>>>>>>> keyring 17.1.1 >>>>>>>>>> keyrings.alt 3.1.1 >>>>>>>>>> numpy 1.21.1 >>>>>>>>>> pip 21.2.3 >>>>>>>>>> py4j 0.10.9 >>>>>>>>>> pycrypto 2.6.1 >>>>>>>>>> PyGObject 3.30.4 >>>>>>>>>> pyspark 3.1.2 >>>>>>>>>> pyxdg 0.25 >>>>>>>>>> PyYAML 5.4.1 >>>>>>>>>> SecretStorage 2.3.1 >>>>>>>>>> setuptools 57.4.0 >>>>>>>>>> six 1.12.0 >>>>>>>>>> wheel 0.32.3 >>>>>>>>>> root@b542b0f1483d:/opt/spark/work-dir# exit >>>>>>>>>> >>>>>>>>>> Now I exited from the image and try to log in again >>>>>>>>>> (pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker >>>>>>>>>> run -u0 -it cfbb0e69f204 bash >>>>>>>>>> >>>>>>>>>> root@5231ee95aa83:/opt/spark/work-dir# pip list >>>>>>>>>> Package Version >>>>>>>>>> ------------- ------- >>>>>>>>>> asn1crypto 0.24.0 >>>>>>>>>> cryptography 2.6.1 >>>>>>>>>> cx-Oracle 8.2.1 >>>>>>>>>> entrypoints 0.3 >>>>>>>>>> keyring 17.1.1 >>>>>>>>>> keyrings.alt 3.1.1 >>>>>>>>>> numpy 1.21.1 >>>>>>>>>> pip 21.2.3 >>>>>>>>>> py4j 0.10.9 >>>>>>>>>> pycrypto 2.6.1 >>>>>>>>>> PyGObject 3.30.4 >>>>>>>>>> pyspark 3.1.2 >>>>>>>>>> pyxdg 0.25 >>>>>>>>>> PyYAML 5.4.1 >>>>>>>>>> SecretStorage 2.3.1 >>>>>>>>>> setuptools 57.4.0 >>>>>>>>>> six 1.12.0 >>>>>>>>>> wheel 0.32.3 >>>>>>>>>> >>>>>>>>>> *Hm that keras is not there*. The docker Image cannot be altered >>>>>>>>>> after build! So once the docker image is created that is just a >>>>>>>>>> snapshot. >>>>>>>>>> However, it will still have tons of useful stuff for most >>>>>>>>>> users/organisations. My suggestions is to create for a given type >>>>>>>>>> (spark, >>>>>>>>>> spark-py etc): >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1. One vanilla flavour for everyday use with few useful >>>>>>>>>> packages >>>>>>>>>> 2. One for medium use with most common packages for ETL/ELT >>>>>>>>>> stuff >>>>>>>>>> 3. One specialist for ML etc with keras, tensorflow and >>>>>>>>>> anything else needed >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> These images should be maintained as we currently maintain spark >>>>>>>>>> releases with accompanying documentation. Any reason why we cannot >>>>>>>>>> maintain >>>>>>>>>> ourselves? >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>> other >>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>> content is >>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, 13 Aug 2021 at 17:26, Holden Karau <hol...@pigscanfly.ca> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> So we actually do have a script that does the build already it's >>>>>>>>>>> more a matter of publishing the results for easier use. Currently >>>>>>>>>>> the >>>>>>>>>>> script produces three images spark, spark-py, and spark-r. I can >>>>>>>>>>> certainly >>>>>>>>>>> see a solid reason to publish like with a jdk11 & jdk8 suffix as >>>>>>>>>>> well if >>>>>>>>>>> there is interest in the community. If we want to have a say >>>>>>>>>>> spark-py-pandas for a Spark container image with everything >>>>>>>>>>> necessary for >>>>>>>>>>> the Koalas stuff to work then I think that could be a great PR from >>>>>>>>>>> someone >>>>>>>>>>> to add :) >>>>>>>>>>> >>>>>>>>>>> On Fri, Aug 13, 2021 at 1:00 AM Mich Talebzadeh < >>>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> should read PySpark >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>>>> other >>>>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>>>> content is >>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, 13 Aug 2021 at 08:51, Mich Talebzadeh < >>>>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Agreed. >>>>>>>>>>>>> >>>>>>>>>>>>> I have already built a few latest for Spark and PYSpark on >>>>>>>>>>>>> 3.1.1 with Java 8 as I found out Java 11 does not work with >>>>>>>>>>>>> Google BigQuery >>>>>>>>>>>>> data warehouse. However, to hack the Dockerfile one finds out the >>>>>>>>>>>>> hard way. >>>>>>>>>>>>> >>>>>>>>>>>>> For example how to add additional Python libraries like >>>>>>>>>>>>> tensorflow etc. Loading these libraries through Kubernetes is not >>>>>>>>>>>>> practical >>>>>>>>>>>>> as unzipping and installing it through --py-files etc will >>>>>>>>>>>>> take considerable time so they need to be added to the dockerfile >>>>>>>>>>>>> at the >>>>>>>>>>>>> built time in directory for Python under Kubernetes >>>>>>>>>>>>> >>>>>>>>>>>>> /opt/spark/kubernetes/dockerfiles/spark/bindings/python >>>>>>>>>>>>> >>>>>>>>>>>>> RUN pip install pyyaml numpy cx_Oracle tensorflow .... >>>>>>>>>>>>> >>>>>>>>>>>>> Also you will need curl to test the ports from inside the >>>>>>>>>>>>> docker >>>>>>>>>>>>> >>>>>>>>>>>>> RUN apt-get update && apt-get install -y curl >>>>>>>>>>>>> RUN ["apt-get","install","-y","vim"] >>>>>>>>>>>>> >>>>>>>>>>>>> As I said I am happy to build these specific dockerfiles plus >>>>>>>>>>>>> the complete documentation for it. I have already built one for >>>>>>>>>>>>> Google >>>>>>>>>>>>> (GCP). The difference between Spark and PySpark version is that in >>>>>>>>>>>>> Spark/scala a fat jar file will contain all needed. That is not >>>>>>>>>>>>> the case >>>>>>>>>>>>> with Python I am afraid. >>>>>>>>>>>>> >>>>>>>>>>>>> HTH >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>>>>> other >>>>>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>>>>> content is >>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for >>>>>>>>>>>>> any >>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, 13 Aug 2021 at 08:13, Bode, Meikel, NMA-CFD < >>>>>>>>>>>>> meikel.b...@bertelsmann.de> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am Meikel Bode and only an interested reader of dev and >>>>>>>>>>>>>> user list. Anyway, I would appreciate to have official docker >>>>>>>>>>>>>> images >>>>>>>>>>>>>> available. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Maybe one could get inspiration from the Jupyter docker >>>>>>>>>>>>>> stacks and provide an hierarchy of different images like this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#image-relationships >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Having a core image only supporting Java, an extended >>>>>>>>>>>>>> supporting Python and/or R etc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Looking forward to the discussion. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Meikel >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>>>>>>>>> *Sent:* Freitag, 13. August 2021 08:45 >>>>>>>>>>>>>> *Cc:* dev <dev@spark.apache.org> >>>>>>>>>>>>>> *Subject:* Re: Time to start publishing Spark Docker Images? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I concur this is a good idea and certainly worth exploring. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> In practice, preparing docker images as deployable will throw >>>>>>>>>>>>>> some challenges because creating docker for Spark is not really >>>>>>>>>>>>>> a singular >>>>>>>>>>>>>> modular unit, say creating docker for Jenkins. It involves >>>>>>>>>>>>>> different >>>>>>>>>>>>>> versions and different images for Spark and PySpark and most >>>>>>>>>>>>>> likely will >>>>>>>>>>>>>> end up as part of Kubernetes deployment. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Individuals and organisations will deploy it as the first >>>>>>>>>>>>>> cut. Great but I equally feel that good documentation on how to >>>>>>>>>>>>>> build a >>>>>>>>>>>>>> consumable deployable image will be more valuable. FRom my own >>>>>>>>>>>>>> experience >>>>>>>>>>>>>> the current documentation should be enhanced, for example how to >>>>>>>>>>>>>> deploy >>>>>>>>>>>>>> working directories, additional Python packages, build with >>>>>>>>>>>>>> different Java >>>>>>>>>>>>>> versions (version 8 or version 11) etc. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> HTH >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790679755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0CkL3HZo9FNVUOnLQ4CYs29Z9HfrwE4xDqLgVmMbr10%3D&reserved=0> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or >>>>>>>>>>>>>> any other >>>>>>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>>>>>> content is >>>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for >>>>>>>>>>>>>> any >>>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, 13 Aug 2021 at 01:54, Holden Karau < >>>>>>>>>>>>>> hol...@pigscanfly.ca> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Awesome, I've filed an INFRA ticket to get the ball rolling. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Aug 12, 2021 at 5:48 PM John Zhuge <jzh...@apache.org> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Aug 12, 2021 at 5:44 PM Hyukjin Kwon < >>>>>>>>>>>>>> gurwls...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> +1, I think we generally agreed upon having it. Thanks Holden >>>>>>>>>>>>>> for headsup and driving this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> +@Dongjoon Hyun <dongj...@apache.org> FYI >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2021년 7월 22일 (목) 오후 12:22, Kent Yao <yaooq...@gmail.com>님이 작성 >>>>>>>>>>>>>> : >>>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Bests, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> *Kent Yao* >>>>>>>>>>>>>> >>>>>>>>>>>>>> @ Data Science Center, Hangzhou Research Institute, NetEase >>>>>>>>>>>>>> Corp. >>>>>>>>>>>>>> >>>>>>>>>>>>>> *a spark* *enthusiast* >>>>>>>>>>>>>> >>>>>>>>>>>>>> *kyuubi >>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyaooqinn%2Fkyuubi&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790679755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZkE%2BAK4%2BUO9JsDzZlAfY5gsATCVm5hidLCp7EGxAWiY%3D&reserved=0>**is >>>>>>>>>>>>>> a unified* *multi-tenant* *JDBC interface for large-scale >>>>>>>>>>>>>> data processing and analytics,* *built on top of* *Apache >>>>>>>>>>>>>> Spark >>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790689711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4YYZ61B6datdx2GsxqnEUOpYuJUn35egYRQSVnUxtF0%3D&reserved=0>* >>>>>>>>>>>>>> *.* >>>>>>>>>>>>>> *spark-authorizer >>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyaooqinn%2Fspark-authorizer&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790689711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P6TMaSh7UeXVyv79RiRqdBpipaIjh2o3DhRs0GGhWF4%3D&reserved=0>**A >>>>>>>>>>>>>> Spark SQL extension which provides SQL Standard Authorization >>>>>>>>>>>>>> for* >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>> It's dark in this basement. >>>> >>> -- >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >> -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau