Hi,
I am aware that some fellow members in this dev group were involved in
creating scripts for running spark on kubernetes
# To build additional PySpark docker image$ ./bin/docker-image-tool.sh
-r <repo> -t my-tag -p
./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build
The problem I have explained is to be able to unpack packages like yaml and
pandas inside k8s
I am using
spark-submit --verbose \
--master k8s://$K8S_SERVER \
--archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.tar.gz
\
--deploy-mode cluster \
--name pytest \
--conf spark.kubernetes.namespace=spark \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=500m \
--conf spark.kubernetes.container.image=${IMAGE} \
--conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
\
--py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}
The directory containing code is zipped as DSBQ.zip and it reads it ok.
However, it says in verbose mode
2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv.tar.gz
In this case it tries to use pandas
The module ${APPLICATION} has this code
import sys
import os
import pkgutil
import pkg_resources
def main():
print("\n printing sys.path")
for p in sys.path:
print(p)
user_paths = os.environ['PYTHONPATH'].split(os.pathsep)
print("\n Printing user_paths")
for p in user_paths:
print(p)
v = sys.version
print("\n python version")
print(v)
print("\nlooping over pkg_resources.working_set")
for r in pkg_resources.working_set:
print(r)
import pandas
if __name__ == "__main__":
main()
The output is shown below
Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv.tar.gz
printing sys.path
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar
/usr/lib/python37.zip
/usr/lib/python3.7
/usr/lib/python3.7/lib-dynload
/usr/local/lib/python3.7/dist-packages
/usr/lib/python3/dist-packages
Printing user_paths
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar
python version
3.7.3 (default, Jan 22 2021, 20:04:44)
[GCC 8.3.0]
looping over pkg_resources.working_set
setuptools 57.2.0
pip 21.1.3
wheel 0.32.3
six 1.12.0
SecretStorage 2.3.1
pyxdg 0.25
PyGObject 3.30.4
pycrypto 2.6.1
keyrings.alt 3.1.1
keyring 17.1.1
entrypoints 0.3
cryptography 2.6.1
asn1crypto 0.24.0
Traceback (most recent call last):
File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 24, in <module>
main()
File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 21, in main
import pandas
ModuleNotFoundError: No module named 'pandas'
Adding that if I go inside the docker and do
185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list
Package Version
------------- -------
asn1crypto 0.24.0
cryptography 2.6.1
entrypoints 0.3
keyring 17.1.1
keyrings.alt 3.1.1
pip 21.1.3
pycrypto 2.6.1
PyGObject 3.30.4
pyxdg 0.25
SecretStorage 2.3.1
setuptools 57.2.0
six 1.12.0
wheel 0.32.3
I don't get any external packages!
I opened a SO thead for this as well.
https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes
Do I need to hack Dockerfile to install the requirement.txt etc?
Thanks
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
---------- Forwarded message ---------
From: Mich Talebzadeh <[email protected]>
Date: Tue, 20 Jul 2021 at 22:51
Subject: Unpacking and using external modules with PySpark inside k8s
To: user @spark <[email protected]>
I have been struggling with this.
Kubernetes (not that matters minikube is working fine. In one of the module
called configure.py I am importing yaml module
import yaml
This is throwing errors
import yaml
ModuleNotFoundError: No module named 'yaml'
I have been through a number of loops.
First I created virtual environment pyspark_venv.tar.gz that includes yaml
module and past it to spark-submit as follows
+ spark-submit --verbose --master k8s://192.168.49.2:8443
'--archives=hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv'
--deploy-mode cluster --name pytest --conf
'spark.kubernetes.namespace=spark' --conf 'spark.executor.instances=1'
--conf 'spark.kubernetes.driver.limit.cores=1' --conf
'spark.executor.cores=1' --conf 'spark.executor.memory=500m' --conf
'spark.kubernetes.container.image=pytest-repo/spark-py:3.1.1' --conf
'spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount'
--py-files hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip hdfs://
50.140.197.220:9000/minikube/codes/testyml.py
Parsed arguments:
master k8s://192.168.49.2:8443
deployMode cluster
executorMemory 500m
executorCores 1
totalExecutorCores null
propertiesFile /opt/spark/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath $SPARK_HOME/jars/*.jar
driverExtraLibraryPath null
driverExtraJavaOptions null
supervise false
queue null
numExecutors 1
files null
pyFiles hdfs://50.140.197.220:9000/minikube/codes/DSBQ.zip
archives hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv
mainClass null
primaryResource hdfs://
50.140.197.220:9000/minikube/codes/testyml.py
name pytest
childArgs []
jars null
packages null
packagesExclusions null
repositories null
verbose true
Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz#pyspark_venv from
/tmp/spark-d339a76e-090c-4670-89aa-da723d6e9fbc/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv
printing sys.path
/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc
/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar
/usr/lib/python37.zip
/usr/lib/python3.7
/usr/lib/python3.7/lib-dynload
/usr/local/lib/python3.7/dist-packages
/usr/lib/python3/dist-packages
Printing user_paths
['/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/DSBQ.zip',
'/opt/spark/python/lib/pyspark.zip',
'/opt/spark/python/lib/py4j-0.10.9-src.zip',
'/opt/spark/jars/spark-core_2.12-3.1.1.jar']
checking yaml
Traceback (most recent call last):
File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
18, in <module>
main()
File "/tmp/spark-20050bca-7eb2-4b06-9bc3-42dce97118fc/testyml.py", line
15, in main
import yaml
ModuleNotFoundError: No module named 'yaml'
Well it does not matter if it is yaml or numpy. It just cannot find the
modules. How can I find out if the gz file is unpacked OK?
Thanks
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.