Willi Raschkowski created SPARK-44767:
-----------------------------------------
Summary: Plugin API for PySpark and SparkR subprocesses
Key: SPARK-44767
URL: https://issues.apache.org/jira/browse/SPARK-44767
Project: Spark
Issue Type: New Feature
Components: Spark Core
Affects Versions: 3.4.1
Reporter: Willi Raschkowski
An API to customize Python and R workers allows for extensibility beyond what
can be expressed via static configs and environment variables like, e.g.,
{{spark.pyspark.python}}.
A use case we had for this is overriding {{PATH}} when using {{spark.archives}}
with, say, conda-pack (as documented
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
Some packages rely on binaries. And if we want to use those packages in Spark,
we need to include their binaries in the {{PATH}}.
But we can't set the {{PATH}} via some config because 1) the environment with
its binaries may be at a dynamic location (archives are unpacked on the driver
[into a directory with random
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
and 2) we may not want to override the {{PATH}} that's pre-configured on the
hosts.
Other use cases unlocked by this include overriding the executable dynamically
(e.g., to select a version) or forking/redirecting the worker's output stream.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]