Re: Time to start publishing Spark Docker Images?

Holden Karau Tue, 17 Aug 2021 15:26:31 -0700

pip installing pyspark like that probably isn't a great idea since there
isn't a version tagged to it. Probably better to install from the local
files copied in than potentially from pypi. Might be able to install in -e
mode where it'll do symlinks to save space I'm not sure.


On Tue, Aug 17, 2021 at 3:12 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Andrew, that was helpful.
>
> Step 10/23 : RUN pip install pyyaml numpy cx_Oracle pyspark --no-cache-dir
>
> And the reduction in size is considerable, 1.75GB vs 2.19GB . Note that
> the original run has now been invalidated
>
> REPOSITORY       TAG                                      IMAGE ID
>  CREATED                  SIZE
> spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8       ecef8bd15731
>  Less than a second ago   1.75GB
> <none>           <none>                                   ba3c17bc9337
>  10 hours ago             2.19GB
>
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 17 Aug 2021 at 20:44, Andrew Melo <andrew.m...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> By default, pip caches downloaded binaries to somewhere like
>> $HOME/.cache/pip. So after doing any "pip install", you'll want to either
>> delete that directory, or pass the "--no-cache-dir" option to pip to
>> prevent the download binaries from being added to the image.
>>
>> HTH
>> Andrew
>>
>> On Tue, Aug 17, 2021 at 2:29 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Andrew,
>>>
>>> Can you please elaborate on blowing pip cache before committing the
>>> layer?
>>>
>>> Thanks,
>>>
>>> Much
>>>
>>> On Tue, 17 Aug 2021 at 16:57, Andrew Melo <andrew.m...@gmail.com> wrote:
>>>
>>>> Silly Q, did you blow away the pip cache before committing the layer?
>>>> That always trips me up.
>>>>
>>>> Cheers
>>>> Andrew
>>>>
>>>> On Tue, Aug 17, 2021 at 10:56 Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> With no additional python packages etc we get 1.4GB compared to 2.19GB
>>>>> before
>>>>>
>>>>> REPOSITORY       TAG                                      IMAGE ID
>>>>>    CREATED                  SIZE
>>>>> spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8only
>>>>>  faee4dbb95dd   Less than a second ago   1.41GB
>>>>> spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8
>>>>>  ba3c17bc9337   4 hours ago              2.19GB
>>>>>
>>>>> root@233a81199b43:/opt/spark/work-dir# pip list
>>>>> Package       Version
>>>>> ------------- -------
>>>>> asn1crypto    0.24.0
>>>>> cryptography  2.6.1
>>>>> entrypoints   0.3
>>>>> keyring       17.1.1
>>>>> keyrings.alt  3.1.1
>>>>> pip           21.2.4
>>>>> pycrypto      2.6.1
>>>>> PyGObject     3.30.4
>>>>> pyxdg         0.25
>>>>> SecretStorage 2.3.1
>>>>> setuptools    57.4.0
>>>>> six           1.12.0
>>>>> wheel         0.32.3
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 17 Aug 2021 at 16:24, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Yes, I will double check. it includes java 8 in addition to base java
>>>>>> 11.
>>>>>>
>>>>>> in addition it has these Python packages for now (added for my own
>>>>>> needs for now)
>>>>>>
>>>>>> root@ce6773017a14:/opt/spark/work-dir# pip list
>>>>>> Package       Version
>>>>>> ------------- -------
>>>>>> asn1crypto    0.24.0
>>>>>> cryptography  2.6.1
>>>>>> cx-Oracle     8.2.1
>>>>>> entrypoints   0.3
>>>>>> keyring       17.1.1
>>>>>> keyrings.alt  3.1.1
>>>>>> numpy         1.21.2
>>>>>> pip           21.2.4
>>>>>> py4j          0.10.9
>>>>>> pycrypto      2.6.1
>>>>>> PyGObject     3.30.4
>>>>>> pyspark       3.1.2
>>>>>> pyxdg         0.25
>>>>>> PyYAML        5.4.1
>>>>>> SecretStorage 2.3.1
>>>>>> setuptools    57.4.0
>>>>>> six           1.12.0
>>>>>> wheel         0.32.3
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 17 Aug 2021 at 16:17, Maciej <mszymkiew...@gmail.com> wrote:
>>>>>>
>>>>>>> Quick question ‒ is this actual output? If so, do we know what
>>>>>>> accounts 1.5GB overhead for PySpark image. Even without
>>>>>>> --no-install-recommends this seems like a lot (if I recall
>>>>>>> correctly it was around 400MB for existing images).
>>>>>>>
>>>>>>>
>>>>>>> On 8/17/21 2:24 PM, Mich Talebzadeh wrote:
>>>>>>>
>>>>>>> Examples:
>>>>>>>
>>>>>>> *docker images*
>>>>>>>
>>>>>>> REPOSITORY       TAG                                  IMAGE ID
>>>>>>>  CREATED          SIZE
>>>>>>>
>>>>>>> spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8   ba3c17bc9337
>>>>>>>  2 minutes ago    2.19GB
>>>>>>>
>>>>>>> spark            3.1.1-scala_2.12-java11              4595c4e78879
>>>>>>>  18 minutes ago   635MB
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 17 Aug 2021 at 10:31, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> 3.1.2_sparkpy_3.7-scala_2.12-java11
>>>>>>>>
>>>>>>>> 3.1.2_sparkR_3.6-scala_2.12-java11
>>>>>>>> Yes let us go with that and remember that we can change the tags
>>>>>>>> anytime. The accompanying release note should detail what is inside the
>>>>>>>> image downloaded.
>>>>>>>>
>>>>>>>> +1 for me
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 17 Aug 2021 at 09:51, Maciej <mszymkiew...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On 8/17/21 4:04 AM, Holden Karau wrote:
>>>>>>>>>
>>>>>>>>> These are some really good points all around.
>>>>>>>>>
>>>>>>>>> I think, in the interest of simplicity, well start with just the 3
>>>>>>>>> current Dockerfiles in the Spark repo but for the next release (3.3) 
>>>>>>>>> we
>>>>>>>>> should explore adding some more Dockerfiles/build options.
>>>>>>>>>
>>>>>>>>> Sounds good.
>>>>>>>>>
>>>>>>>>> However, I'd consider adding guest lang version to the tag names,
>>>>>>>>> i.e.
>>>>>>>>>
>>>>>>>>> 3.1.2_sparkpy_3.7-scala_2.12-java11
>>>>>>>>>
>>>>>>>>> 3.1.2_sparkR_3.6-scala_2.12-java11
>>>>>>>>>
>>>>>>>>> and some basics safeguards in the layers, to make sure that these
>>>>>>>>> are really the versions we use.
>>>>>>>>>
>>>>>>>>> On Mon, Aug 16, 2021 at 10:46 AM Maciej <mszymkiew...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have a few concerns regarding PySpark and SparkR images.
>>>>>>>>>>
>>>>>>>>>> First of all, how do we plan to handle interpreter versions?
>>>>>>>>>> Ideally, we should provide images for all supported variants, but 
>>>>>>>>>> based on
>>>>>>>>>> the preceding discussion and the proposed naming convention, I 
>>>>>>>>>> assume it is
>>>>>>>>>> not going to happen. If that's the case, it would be great if we 
>>>>>>>>>> could fix
>>>>>>>>>> interpreter versions based on some support criteria (lowest 
>>>>>>>>>> supported,
>>>>>>>>>> lowest non-deprecated, highest supported at the time of release, 
>>>>>>>>>> etc.)
>>>>>>>>>>
>>>>>>>>>> Currently, we use the following:
>>>>>>>>>>
>>>>>>>>>>    - for R use buster-cran35 Debian repositories which install R
>>>>>>>>>>    3.6 (provided version already changed in the past and broke image 
>>>>>>>>>> build ‒
>>>>>>>>>>    SPARK-28606).
>>>>>>>>>>    - for Python we depend on the system provided python3
>>>>>>>>>>    packages, which currently provides Python 3.7.
>>>>>>>>>>
>>>>>>>>>> which don't guarantee stability over time and might be hard to
>>>>>>>>>> synchronize with our support matrix.
>>>>>>>>>>
>>>>>>>>>> Secondly, omitting libraries which are required for the full
>>>>>>>>>> functionality and performance, specifically
>>>>>>>>>>
>>>>>>>>>>    - Numpy, Pandas and Arrow for PySpark
>>>>>>>>>>    - Arrow for SparkR
>>>>>>>>>>
>>>>>>>>>> is likely to severely limit usability of the images (out of
>>>>>>>>>> these, Arrow is probably the hardest to manage, especially when you 
>>>>>>>>>> already
>>>>>>>>>> depend on system packages to provide R or Python interpreter).
>>>>>>>>>>
>>>>>>>>>> On 8/14/21 12:43 AM, Mich Talebzadeh wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We can cater for multiple types (spark, spark-py and spark-r) and
>>>>>>>>>> spark versions (assuming they are downloaded and available).
>>>>>>>>>> The challenge is that these docker images built are snapshots.
>>>>>>>>>> They cannot be amended later and if you change anything by going 
>>>>>>>>>> inside
>>>>>>>>>> docker, as soon as you are logged out whatever you did is reversed.
>>>>>>>>>>
>>>>>>>>>> For example, I want to add tensorflow to my docker image. These
>>>>>>>>>> are my images
>>>>>>>>>>
>>>>>>>>>> REPOSITORY                                TAG           IMAGE ID
>>>>>>>>>>      CREATED         SIZE
>>>>>>>>>> eu.gcr.io/axial-glow-224522/spark-py      java8_3.1.1
>>>>>>>>>>  cfbb0e69f204   5 days ago      2.37GB
>>>>>>>>>> eu.gcr.io/axial-glow-224522/spark         3.1.1
>>>>>>>>>>  8d1bf8e7e47d   5 days ago      805MB
>>>>>>>>>>
>>>>>>>>>> using image ID I try to log in as root to the image
>>>>>>>>>>
>>>>>>>>>> *docker run -u0 -it cfbb0e69f204 bash*
>>>>>>>>>>
>>>>>>>>>> root@b542b0f1483d:/opt/spark/work-dir# pip install keras
>>>>>>>>>> Collecting keras
>>>>>>>>>>   Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
>>>>>>>>>>      |████████████████████████████████| 1.3 MB 1.1 MB/s
>>>>>>>>>> Installing collected packages: keras
>>>>>>>>>> Successfully installed keras-2.6.0
>>>>>>>>>> WARNING: Running pip as the 'root' user can result in broken
>>>>>>>>>> permissions and conflicting behaviour with the system package 
>>>>>>>>>> manager. It
>>>>>>>>>> is recommended to use a virtual environment instead:
>>>>>>>>>> https://pip.pypa.io/warnings/venv
>>>>>>>>>> root@b542b0f1483d:/opt/spark/work-dir# pip list
>>>>>>>>>> Package       Version
>>>>>>>>>> ------------- -------
>>>>>>>>>> asn1crypto    0.24.0
>>>>>>>>>> cryptography  2.6.1
>>>>>>>>>> cx-Oracle     8.2.1
>>>>>>>>>> entrypoints   0.3
>>>>>>>>>> *keras         2.6.0      <--- it is here*
>>>>>>>>>> keyring       17.1.1
>>>>>>>>>> keyrings.alt  3.1.1
>>>>>>>>>> numpy         1.21.1
>>>>>>>>>> pip           21.2.3
>>>>>>>>>> py4j          0.10.9
>>>>>>>>>> pycrypto      2.6.1
>>>>>>>>>> PyGObject     3.30.4
>>>>>>>>>> pyspark       3.1.2
>>>>>>>>>> pyxdg         0.25
>>>>>>>>>> PyYAML        5.4.1
>>>>>>>>>> SecretStorage 2.3.1
>>>>>>>>>> setuptools    57.4.0
>>>>>>>>>> six           1.12.0
>>>>>>>>>> wheel         0.32.3
>>>>>>>>>> root@b542b0f1483d:/opt/spark/work-dir# exit
>>>>>>>>>>
>>>>>>>>>> Now I exited from the image and try to log in again
>>>>>>>>>> (pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker
>>>>>>>>>> run -u0 -it cfbb0e69f204 bash
>>>>>>>>>>
>>>>>>>>>> root@5231ee95aa83:/opt/spark/work-dir# pip list
>>>>>>>>>> Package       Version
>>>>>>>>>> ------------- -------
>>>>>>>>>> asn1crypto    0.24.0
>>>>>>>>>> cryptography  2.6.1
>>>>>>>>>> cx-Oracle     8.2.1
>>>>>>>>>> entrypoints   0.3
>>>>>>>>>> keyring       17.1.1
>>>>>>>>>> keyrings.alt  3.1.1
>>>>>>>>>> numpy         1.21.1
>>>>>>>>>> pip           21.2.3
>>>>>>>>>> py4j          0.10.9
>>>>>>>>>> pycrypto      2.6.1
>>>>>>>>>> PyGObject     3.30.4
>>>>>>>>>> pyspark       3.1.2
>>>>>>>>>> pyxdg         0.25
>>>>>>>>>> PyYAML        5.4.1
>>>>>>>>>> SecretStorage 2.3.1
>>>>>>>>>> setuptools    57.4.0
>>>>>>>>>> six           1.12.0
>>>>>>>>>> wheel         0.32.3
>>>>>>>>>>
>>>>>>>>>> *Hm that keras is not there*. The docker Image cannot be altered
>>>>>>>>>> after build! So once the docker image is created that is just a 
>>>>>>>>>> snapshot.
>>>>>>>>>> However, it will still have tons of useful stuff for most
>>>>>>>>>> users/organisations. My suggestions is to create for a given type 
>>>>>>>>>> (spark,
>>>>>>>>>> spark-py etc):
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. One vanilla flavour for everyday use with few useful
>>>>>>>>>>    packages
>>>>>>>>>>    2. One for medium use with most common packages for ETL/ELT
>>>>>>>>>>    stuff
>>>>>>>>>>    3. One specialist for ML etc with keras, tensorflow and
>>>>>>>>>>    anything else needed
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> These images should be maintained as we currently maintain spark
>>>>>>>>>> releases with accompanying documentation. Any reason why we cannot 
>>>>>>>>>> maintain
>>>>>>>>>> ourselves?
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, 13 Aug 2021 at 17:26, Holden Karau <hol...@pigscanfly.ca>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> So we actually do have a script that does the build already it's
>>>>>>>>>>> more a matter of publishing the results for easier use. Currently 
>>>>>>>>>>> the
>>>>>>>>>>> script produces three images spark, spark-py, and spark-r. I can 
>>>>>>>>>>> certainly
>>>>>>>>>>> see a solid reason to publish like with a jdk11 & jdk8 suffix as 
>>>>>>>>>>> well if
>>>>>>>>>>> there is interest in the community. If we want to have a say
>>>>>>>>>>> spark-py-pandas for a Spark container image with everything 
>>>>>>>>>>> necessary for
>>>>>>>>>>> the Koalas stuff to work then I think that could be a great PR from 
>>>>>>>>>>> someone
>>>>>>>>>>> to add :)
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 13, 2021 at 1:00 AM Mich Talebzadeh <
>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> should read PySpark
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>>>> other
>>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>>> content is
>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 13 Aug 2021 at 08:51, Mich Talebzadeh <
>>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Agreed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have already built a few latest for Spark and PYSpark on
>>>>>>>>>>>>> 3.1.1 with Java 8 as I found out Java 11 does not work with 
>>>>>>>>>>>>> Google BigQuery
>>>>>>>>>>>>> data warehouse. However, to hack the Dockerfile one finds out the 
>>>>>>>>>>>>> hard way.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example how to add additional Python libraries like
>>>>>>>>>>>>> tensorflow etc. Loading these libraries through Kubernetes is not 
>>>>>>>>>>>>> practical
>>>>>>>>>>>>> as unzipping and installing it through --py-files etc will
>>>>>>>>>>>>> take considerable time so they need to be added to the dockerfile 
>>>>>>>>>>>>> at the
>>>>>>>>>>>>> built time in directory for Python under Kubernetes
>>>>>>>>>>>>>
>>>>>>>>>>>>> /opt/spark/kubernetes/dockerfiles/spark/bindings/python
>>>>>>>>>>>>>
>>>>>>>>>>>>> RUN pip install pyyaml numpy cx_Oracle tensorflow ....
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also you will need curl to test the ports from inside the
>>>>>>>>>>>>> docker
>>>>>>>>>>>>>
>>>>>>>>>>>>> RUN apt-get update && apt-get install -y curl
>>>>>>>>>>>>> RUN ["apt-get","install","-y","vim"]
>>>>>>>>>>>>>
>>>>>>>>>>>>> As I said I am happy to build these specific dockerfiles plus
>>>>>>>>>>>>> the complete documentation for it. I have already built one for 
>>>>>>>>>>>>> Google
>>>>>>>>>>>>> (GCP). The difference between Spark and PySpark version is that in
>>>>>>>>>>>>> Spark/scala a fat jar file will contain all needed. That is not 
>>>>>>>>>>>>> the case
>>>>>>>>>>>>> with Python I am afraid.
>>>>>>>>>>>>>
>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>>>>> other
>>>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>>>> content is
>>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for 
>>>>>>>>>>>>> any
>>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, 13 Aug 2021 at 08:13, Bode, Meikel, NMA-CFD <
>>>>>>>>>>>>> meikel.b...@bertelsmann.de> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am Meikel Bode and only an interested reader of dev and
>>>>>>>>>>>>>> user list. Anyway, I would appreciate to have official docker 
>>>>>>>>>>>>>> images
>>>>>>>>>>>>>> available.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe one could get inspiration from the Jupyter docker
>>>>>>>>>>>>>> stacks and provide an hierarchy of different images like this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#image-relationships
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Having a core image only supporting Java, an extended
>>>>>>>>>>>>>> supporting Python and/or R etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Looking forward to the discussion.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Meikel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>>>>>>>>>> *Sent:* Freitag, 13. August 2021 08:45
>>>>>>>>>>>>>> *Cc:* dev <dev@spark.apache.org>
>>>>>>>>>>>>>> *Subject:* Re: Time to start publishing Spark Docker Images?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I concur this is a good idea and certainly worth exploring.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In practice, preparing docker images as deployable will throw
>>>>>>>>>>>>>> some challenges because creating docker for Spark  is not really 
>>>>>>>>>>>>>> a singular
>>>>>>>>>>>>>> modular unit, say  creating docker for Jenkins. It involves 
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>> versions and different images for Spark and PySpark and most 
>>>>>>>>>>>>>> likely will
>>>>>>>>>>>>>> end up as part of Kubernetes deployment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Individuals and organisations will deploy it as the first
>>>>>>>>>>>>>> cut. Great but I equally feel that good documentation on how to 
>>>>>>>>>>>>>> build a
>>>>>>>>>>>>>> consumable deployable image will be more valuable.  FRom my own 
>>>>>>>>>>>>>> experience
>>>>>>>>>>>>>> the current documentation should be enhanced, for example how to 
>>>>>>>>>>>>>> deploy
>>>>>>>>>>>>>> working directories, additional Python packages, build with 
>>>>>>>>>>>>>> different Java
>>>>>>>>>>>>>> versions  (version 8 or version 11) etc.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmich-talebzadeh-ph-d-5205b2%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790679755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0CkL3HZo9FNVUOnLQ4CYs29Z9HfrwE4xDqLgVmMbr10%3D&reserved=0>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>>>>> responsibility for any loss, damage or destruction of data or 
>>>>>>>>>>>>>> any other
>>>>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>>>>> content is
>>>>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for 
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, 13 Aug 2021 at 01:54, Holden Karau <
>>>>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Awesome, I've filed an INFRA ticket to get the ball rolling.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 12, 2021 at 5:48 PM John Zhuge <jzh...@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 12, 2021 at 5:44 PM Hyukjin Kwon <
>>>>>>>>>>>>>> gurwls...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1, I think we generally agreed upon having it. Thanks Holden
>>>>>>>>>>>>>> for headsup and driving this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +@Dongjoon Hyun <dongj...@apache.org> FYI
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2021년 7월 22일 (목) 오후 12:22, Kent Yao <yaooq...@gmail.com>님이 작성
>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Kent Yao*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> @ Data Science Center, Hangzhou Research Institute, NetEase
>>>>>>>>>>>>>> Corp.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *a spark* *enthusiast*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *kyuubi
>>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyaooqinn%2Fkyuubi&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790679755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZkE%2BAK4%2BUO9JsDzZlAfY5gsATCVm5hidLCp7EGxAWiY%3D&reserved=0>**is
>>>>>>>>>>>>>> a unified* *multi-tenant* *JDBC interface for large-scale
>>>>>>>>>>>>>> data processing and analytics,* *built on top of* *Apache
>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2F&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790689711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4YYZ61B6datdx2GsxqnEUOpYuJUn35egYRQSVnUxtF0%3D&reserved=0>*
>>>>>>>>>>>>>> *.*
>>>>>>>>>>>>>> *spark-authorizer
>>>>>>>>>>>>>> <https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyaooqinn%2Fspark-authorizer&data=04%7C01%7CMeikel.Bode%40bertelsmann.de%7Cd97d97be540246aa975308d95e260c99%7C1ca8bd943c974fc68955bad266b43f0b%7C0%7C0%7C637644339790689711%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P6TMaSh7UeXVyv79RiRqdBpipaIjh2o3DhRs0GGhWF4%3D&reserved=0>**A
>>>>>>>>>>>>>> Spark SQL extension which provides SQL Standard Authorization 
>>>>>>>>>>>>>> for*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>> It's dark in this basement.
>>>>
>>> --
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: Time to start publishing Spark Docker Images?

Reply via email to