Some people asked me whether it was possible to create a docker file (spark 3.1.3) with Python packages geared towards DS etc., having the following pre-built packages
pyyaml TensorFlow Theano Pandas Keras NumPy SciPy Scrapy SciKit-Learn XGBoost Matplotlib Seaborn Bokeh Plotly pydot Statsmodels Ok I built and pushed this to the docker repository. It is called spark-py-pthonpackages-3.1.3-scala_2.12-11-jre-slim-buster <https://hub.docker.com/layers/michtalebzadeh/spark_dockerfiles/spark-py-pthonpackages-3.1.3-scala_2.12-11-jre-slim-buster/images/sha256-1a76a9279e9dbaeb9c554fba601b85ecd76cbf3956c81b94eb4552c2d1435366?context=repo> It is 1.3 GB compared to the normal spark-py package of 432.79 MB and you can download it from https://hub.docker.com/repository/docker/michtalebzadeh/spark_dockerfiles/tags?page=1&ordering=last_updated These are the loaded packages from inside this docker docker run -u 0 -it 7621929f9c97 bash root@bb71cb7a89de:/opt/spark/work-dir# pip list Package Version ---------------------------- ------------------- absl-py 1.0.0 astunparse 1.6.3 attrs 21.4.0 Automat 20.2.0 bokeh 2.4.2 cachetools 5.0.0 certifi 2021.10.8 cffi 1.15.0 charset-normalizer 2.0.12 constantly 15.1.0 cryptography 36.0.1 cssselect 1.1.0 cycler 0.11.0 flatbuffers 2.0 fonttools 4.29.1 gast 0.5.3 google-auth 2.6.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 grpcio 1.44.0 h2 3.2.0 h5py 3.6.0 hpack 3.0.0 hyperframe 5.2.0 hyperlink 21.0.0 idna 3.3 importlib-metadata 4.11.1 incremental 21.3.0 itemadapter 0.4.0 itemloaders 1.0.4 Jinja2 3.0.3 jmespath 0.10.0 joblib 1.1.0 keras 2.8.0 Keras-Preprocessing 1.1.2 kiwisolver 1.3.2 libclang 13.0.0 lxml 4.8.0 Markdown 3.3.6 MarkupSafe 2.1.0 matplotlib 3.5.1 numpy 1.22.2 oauthlib 3.2.0 opt-einsum 3.3.0 packaging 21.3 pandas 1.4.1 parsel 1.6.0 patsy 0.5.2 Pillow 9.0.1 pip 22.0.3 plotly 5.6.0 priority 1.3.0 Protego 0.2.1 protobuf 3.19.4 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.21 PyDispatcher 2.0.5 pydot 1.4.2 pyOpenSSL 22.0.0 pyparsing 3.0.7 python-dateutil 2.8.2 pytz 2021.3 PyYAML 6.0 queuelib 1.6.2 requests 2.27.1 requests-oauthlib 1.3.1 rsa 4.8 scikit-learn 1.0.2 scipy 1.8.0 Scrapy 2.5.1 seaborn 0.11.2 service-identity 21.1.0 setuptools 60.9.3 six 1.16.0 statsmodels 0.13.2 tenacity 8.0.1 tensorboard 2.8.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow 2.8.0 tensorflow-io-gcs-filesystem 0.24.0 termcolor 1.1.0 tf-estimator-nightly 2.8.0.dev2021122109 Theano 1.0.5 threadpoolctl 3.1.0 tornado 6.1 Twisted 22.1.0 typing_extensions 4.1.1 urllib3 1.26.8 w3lib 1.22.0 Werkzeug 2.0.3 wheel 0.34.2 wrapt 1.13.3 xgboost 1.5.2 zipp 3.7.0 zope.interface 5.4.0 Let me know how it works for you. view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.