HyukjinKwon opened a new pull request #30735:
URL: https://github.com/apache/spark/pull/30735
### What changes were proposed in this pull request?
This PR proposes:
- Remove `spark.kubernetes.pyspark.pythonVersion` and use `PYSPARK_PYTHON`
with `PYSPARK_DRIVER_PYTHON` just like other cluster types in Spark.
Kubernates is still experimental and not GA-ed yet. Removing
configuration should be fine. Currently, the configuration
`spark.kubernetes.pyspark.pythonVersion` cannot have any value different from
`3` anyway.
- In order for `PYSPARK_PYTHON` to be consistently used, fix
`spark.archives` option to unpack into the current working directory in cluster
mode's driver. This behaviour is identical with Yarn's cluster mode. By doing
this, users can leverage Conda or virtuenenv in cluster mode as below:
```python
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas
conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
PYSPARK_PYTHON=./environment/bin/python spark-submit --archives
pyspark_conda_env.tar.gz#environment app.py
```
- Removed several unused or useless codes such as `extractS3Key` and
`renameResourcesToLocalFS`
### Why are the changes needed?
- To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and
`PYSPARK_DRIVER_PYTHON`.
- To provide Conda and virtualenv support via `spark.archives` options.
### Does this PR introduce _any_ user-facing change?
Yes:
- `spark.kubernetes.pyspark.pythonVersion` is removed.
- `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` are used instead.
### How was this patch tested?
Manually tested via:
```bash
minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit
--serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh
--spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark
--namespace spark-integration-test
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]