[
https://issues.apache.org/jira/browse/SPARK-47540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-47540.
----------------------------------
Fix Version/s: 4.0.0
Assignee: Hyukjin Kwon
Resolution: Done
> SPIP: Pure Python Package (Spark Connect)
> -----------------------------------------
>
> Key: SPARK-47540
> URL: https://issues.apache.org/jira/browse/SPARK-47540
> Project: Spark
> Issue Type: Umbrella
> Components: Connect, PySpark
> Affects Versions: 4.0.0
> Reporter: Hyukjin Kwon
> Assignee: Hyukjin Kwon
> Priority: Critical
> Fix For: 4.0.0
>
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> As part of the [Spark
> Connect|https://spark.apache.org/docs/latest/spark-connect-overview.html]
> development, we have introduced Scala and Python clients. While the Scala
> client is already provided as a separate library and is available in Maven,
> the Python client is not. This proposal aims for end users to install the
> pure Python package for Spark Connect by using pip install pyspark-connect.
> The pure Python package contains only Python source code without jars, which
> reduces the size of the package significantly and widens the use cases of
> PySpark. See also [Introducing Spark Connect - The Power of Apache Spark,
> Everywhere'|https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html].
> *Q2. What problem is this proposal NOT designed to solve?*
> This proposal does not aim to Change existing PySpark package, e.g., pip
> install pyspark is not affected
> - Implement full compatibility with classic PySpark, e.g., implementing RDD
> API
> - Address how to launch Spark Connect server. Spark Connect server is
> launched by users themselves
> - Local mode. Without launching Spark Connect server, users cannot use this
> package.
> - [Official release channel|https://spark.apache.org/downloads.html] is not
> affected but only PyPI.
> *Q3. How is it done today, and what are the limits of current practice?*
> Currently, we run pip install pyspark, and it is over 300MB because of
> dependent jars. In addition, PySpark requires you to set up other
> environments such as JDK installation.
> This is not suitable when the running environment and resource is limited
> such as edge devices such as smart home devices.
> Requiring a non-Python environment is not Python friendly.
> *Q4. What is new in your approach and why do you think it will be successful?*
> It provides a pure Python library, which eliminates other environment
> requirements such as JDK, and reduces the resource usage by decoupling Spark
> Driver, and reduces the package size.
> *Q5. Who cares? If you are successful, what difference will it make?*
> Users who want to leverage Spark in the limited environment, and want to
> decouple running JVM with Spark Driver to run Spark as a Service. They can
> simply pip install pyspark-connect that does not require other dependencies
> (except Python dependencies just like other Python libraries).
> *Q6. What are the risks?*
> Because we do not change the existing PySpark package, I do not see any major
> risk in classic PySpark itself. We will reuse the same Python source, and
> therefore we should make sure no Py4J is used, and no JVM access is made.
> This requirement might confuse the developers. At the very least, we should
> add the dedicated CI to make sure the pure Python package works.
> *Q7. How long will it take?*
> I expect around one month including CI set up. In fact, the prototype is
> ready so I expect this to be done sooner.
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term goal is to set up a scheduled CI job that builds the pure Python
> library, and runs all the tests against them.
> The final goral would be to properly test end-to-end usecase from pip
> installation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]