[jira] [Created] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses

Willi Raschkowski (Jira) Thu, 10 Aug 2023 14:42:30 -0700

Willi Raschkowski created SPARK-44767:
-----------------------------------------


             Summary: Plugin API for PySpark and SparkR subprocesses
                 Key: SPARK-44767
                 URL: https://issues.apache.org/jira/browse/SPARK-44767
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 3.4.1
            Reporter: Willi Raschkowski


An API to customize Python and R workers allows for extensibility beyond what 
can be expressed via static configs and environment variables like, e.g., 
{{spark.pyspark.python}}.

A use case we had for this is overriding {{PATH}} when using {{spark.archives}} 
with, say, conda-pack (as documented 
[here|https://spark.apache.org/docs/3.1.1/api/python/user_guide/python_packaging.html#using-conda]).
 Some packages rely on binaries. And if we want to use those packages in Spark, 
we need to include their binaries in the {{PATH}}.

But we can't set the {{PATH}} via some config because 1) the environment with 
its binaries may be at a dynamic location (archives are unpacked on the driver 
[into a directory with random 
name|https://github.com/apache/spark/blob/5db87787d5cc1cefb51ec77e49bac7afaa46d300/core/src/main/scala/org/apache/spark/SparkFiles.scala#L33-L37]),
 and 2) we may not want to override the {{PATH}} that's pre-configured on the 
hosts.

Other use cases unlocked by this include overriding the executable dynamically 
(e.g., to select a version) or forking/redirecting the worker's output stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-44767) Plugin API for PySpark and SparkR subprocesses

Reply via email to