[jira] [Updated] (SPARK-47540) SPIP: Pure Python Package (Spark Connect)

Hyukjin Kwon (Jira) Sun, 31 Mar 2024 23:43:42 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-47540:
---------------------------------
    Description: 
*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

As part of the Spark Connect development, we have introduced Scala and Python 
clients. While the Scala client is already provided as a separate library and 
is available in Maven, the Python client is not. This proposal aims for end 
users to install the pure Python package for Spark Connect by using pip install 
pyspark-connect.

The pure Python package contains only Python source code without jars, which 
reduces the size of the package significantly and widens the use cases of 
PySpark. See also Introducing Spark Connect - The Power of Apache Spark, 
Everywhere'.

*Q2. What problem is this proposal NOT designed to solve?*

This proposal does not aim to Change existing PySpark package, e.g., pip 
install pyspark is not affected
Implement full compatibility with classic PySpark, e.g., implementing RDD API
Address how to launch Spark Connect server. Spark Connect server is launched by 
users themselves
Local mode. Without launching Spark Connect server, users cannot use this 
package.
Official release channel is not affected but only PyPI.

*Q3. How is it done today, and what are the limits of current practice?*

Currently, we run pip install pyspark, and it is over 300MB because of 
dependent jars. In addition, PySpark requires you to set up other environments 
such as JDK installation.
This is not suitable when the running environment and resource is limited such 
as edge devices such as smart home devices.
Requiring a non-Python environment is not Python friendly.

*Q4. What is new in your approach and why do you think it will be successful?*

It provides a pure Python library, which eliminates other environment 
requirements such as JDK, and reduces the resource usage by decoupling Spark 
Driver, and reduces the package size.

*Q5. Who cares? If you are successful, what difference will it make?*

Users who want to leverage Spark in the limited environment, and want to 
decouple running JVM with Spark Driver to run Spark as a Service. They can 
simply pip install pyspark-connect that does not require other dependencies 
(except Python dependencies just like other Python libraries). 

*Q6. What are the risks?*

Because we do not change the existing PySpark package, I do not see any major 
risk in classic PySpark itself. We will reuse the same Python source, and 
therefore we should make sure no Py4J is used, and no JVM access is made. This 
requirement might confuse the developers. At the very least, we should add the 
dedicated CI to make sure the pure Python package works.

*Q7. How long will it take?*

I expect around one month including CI set up. In fact, the prototype is ready 
so I expect this to be done sooner.

*Q8. What are the mid-term and final “exams” to check for success?*

The mid-term goal is to set up a scheduled CI job that builds the pure Python 
library, and runs all the tests against them.
The final goral would be to properly test end-to-end usecase from pip 
installation.


> SPIP: Pure Python Package (Spark Connect)
> -----------------------------------------
>
>                 Key: SPARK-47540
>                 URL: https://issues.apache.org/jira/browse/SPARK-47540
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Connect, PySpark
>    Affects Versions: 4.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Critical
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> As part of the Spark Connect development, we have introduced Scala and Python 
> clients. While the Scala client is already provided as a separate library and 
> is available in Maven, the Python client is not. This proposal aims for end 
> users to install the pure Python package for Spark Connect by using pip 
> install pyspark-connect.
> The pure Python package contains only Python source code without jars, which 
> reduces the size of the package significantly and widens the use cases of 
> PySpark. See also Introducing Spark Connect - The Power of Apache Spark, 
> Everywhere'.
> *Q2. What problem is this proposal NOT designed to solve?*
> This proposal does not aim to Change existing PySpark package, e.g., pip 
> install pyspark is not affected
> Implement full compatibility with classic PySpark, e.g., implementing RDD API
> Address how to launch Spark Connect server. Spark Connect server is launched 
> by users themselves
> Local mode. Without launching Spark Connect server, users cannot use this 
> package.
> Official release channel is not affected but only PyPI.
> *Q3. How is it done today, and what are the limits of current practice?*
> Currently, we run pip install pyspark, and it is over 300MB because of 
> dependent jars. In addition, PySpark requires you to set up other 
> environments such as JDK installation.
> This is not suitable when the running environment and resource is limited 
> such as edge devices such as smart home devices.
> Requiring a non-Python environment is not Python friendly.
> *Q4. What is new in your approach and why do you think it will be successful?*
> It provides a pure Python library, which eliminates other environment 
> requirements such as JDK, and reduces the resource usage by decoupling Spark 
> Driver, and reduces the package size.
> *Q5. Who cares? If you are successful, what difference will it make?*
> Users who want to leverage Spark in the limited environment, and want to 
> decouple running JVM with Spark Driver to run Spark as a Service. They can 
> simply pip install pyspark-connect that does not require other dependencies 
> (except Python dependencies just like other Python libraries). 
> *Q6. What are the risks?*
> Because we do not change the existing PySpark package, I do not see any major 
> risk in classic PySpark itself. We will reuse the same Python source, and 
> therefore we should make sure no Py4J is used, and no JVM access is made. 
> This requirement might confuse the developers. At the very least, we should 
> add the dedicated CI to make sure the pure Python package works.
> *Q7. How long will it take?*
> I expect around one month including CI set up. In fact, the prototype is 
> ready so I expect this to be done sooner.
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term goal is to set up a scheduled CI job that builds the pure Python 
> library, and runs all the tests against them.
> The final goral would be to properly test end-to-end usecase from pip 
> installation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-47540) SPIP: Pure Python Package (Spark Connect)

Reply via email to