[GitHub] [spark] HyukjinKwon opened a new pull request #30735: [SPARK-33748][K8S] Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

GitBox Fri, 11 Dec 2020 05:03:31 -0800


HyukjinKwon opened a new pull request #30735:
URL: https://github.com/apache/spark/pull/30735



   ### What changes were proposed in this pull request?
   
   This PR proposes:
   
   - Remove `spark.kubernetes.pyspark.pythonVersion` and use `PYSPARK_PYTHON` 
with `PYSPARK_DRIVER_PYTHON` just like other cluster types in Spark.
       Kubernates is still experimental and not GA-ed yet. Removing 
configuration should be fine. Currently, the configuration 
`spark.kubernetes.pyspark.pythonVersion` cannot have any value different from 
`3` anyway.
   
   - In order for `PYSPARK_PYTHON` to be consistently used, fix 
`spark.archives` option to unpack into the current working directory in cluster 
mode's driver. This behaviour is identical with Yarn's cluster mode. By doing 
this, users can leverage Conda or virtuenenv in cluster mode as below:
   
      ```python
       conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas 
conda-pack
       conda activate pyspark_conda_env
       conda pack -f -o pyspark_conda_env.tar.gz
       PYSPARK_PYTHON=./environment/bin/python spark-submit --archives 
pyspark_conda_env.tar.gz#environment app.py
      ```
   
   - Removed several unused or useless codes such as `extractS3Key` and 
`renameResourcesToLocalFS`
   
   ### Why are the changes needed?
   
   - To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and 
`PYSPARK_DRIVER_PYTHON`.
   - To provide Conda and virtualenv support via `spark.archives` options.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes:
   
   - `spark.kubernetes.pyspark.pythonVersion` is removed.
   - `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are used instead.
   
   
   ### How was this patch tested?
   
   Manually tested via:
   
   ```bash
   minikube delete
   minikube start --cpus 12 --memory 16384
   kubectl create namespace spark-integration-test
   cat <<EOF | kubectl apply -f -
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: spark
     namespace: spark-integration-test
   EOF
   kubectl create clusterrolebinding spark-role --clusterrole=edit 
--serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
   dev/make-distribution.sh --pip --tgz -Pkubernetes
   
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh 
--spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz  --service-account spark 
--namespace spark-integration-test
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #30735: [SPARK-33748][K8S] Support PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables

Reply via email to