[
https://issues.apache.org/jira/browse/SPARK-34849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309492#comment-17309492
]
Haejoon Lee edited comment on SPARK-34849 at 3/26/21, 3:23 PM:
---------------------------------------------------------------
The official SPIP voting started at
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html
was (Author: itholic):
The SPIP voting started at
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html
> SPIP: Support pandas API layer on PySpark
> -----------------------------------------
>
> Key: SPARK-34849
> URL: https://issues.apache.org/jira/browse/SPARK-34849
> Project: Spark
> Issue Type: Umbrella
> Components: PySpark
> Affects Versions: 3.2.0
> Reporter: Haejoon Lee
> Priority: Blocker
> Labels: SPIP
>
> This is a SPIP for porting [Koalas
> project|https://github.com/databricks/koalas] to PySpark, that is once
> discussed on the dev-mailing list with the same title, [[DISCUSS] Support
> pandas API layer on
> PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> Porting Koalas into PySpark to support the pandas API layer on PySpark for:
> - Users can easily leverage their existing Spark cluster to scale their
> pandas workloads.
> - Support plot and drawing a chart in PySpark
> - Users can easily switch between pandas APIs and PySpark APIs
> *Q2. What problem is this proposal NOT designed to solve?*
> Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}}
> in pandas will not be supported because DataFrames are not materialized in
> memory in Spark unlike pandas.
> This does not replace the existing PySpark APIs. PySpark API has lots of
> users and existing code in many projects, and there are still many PySpark
> users who prefer Spark’s immutable DataFrame API to the pandas API.
> *Q3. How is it done today, and what are the limits of current practice?*
> The current practice has 2 limits as below.
> # There are many features missing in Apache Spark that are very commonly
> used in data science. Specifically, plotting and drawing a chart is missing
> which is one of the most important features that almost every data scientist
> use in their daily work.
> # Data scientists tend to prefer pandas APIs, but it is very hard to change
> them into PySpark APIs when they need to scale their workloads. This is
> because PySpark APIs are difficult to learn compared to pandas' and there are
> many missing features in PySpark.
> *Q4. What is new in your approach and why do you think it will be successful?*
> I believe this suggests a new way for both PySpark and pandas users to easily
> scale their workloads. I think we can be successful because more and more
> people tend to use Python and pandas. In fact, there are already similar
> tries such as Dask and Modin which are all growing fast and successfully.
> *Q5. Who cares? If you are successful, what difference will it make?*
> Anyone who wants to scale their pandas workloads on their Spark cluster. It
> will also significantly improve the usability of PySpark.
> *Q6. What are the risks?*
> Technically I don't see many risks yet given that:
> - Koalas has grown separately for more than two years, and has greatly
> improved maturity and stability.
> - Koalas will be ported into PySpark as a separate package
> It is more about putting documentation and test cases in place properly with
> properly handling dependencies. For example, Koalas currently uses pytest
> with various dependencies whereas PySpark uses the plain unittest with fewer
> dependencies.
> In addition, Koalas' default Indexing system could not be much loved because
> it could potentially cause overhead, so applying it properly to PySpark might
> be a challenge.
> *Q7. How long will it take?*
> Before the Spark 3.2 release.
> *Q8. What are the mid-term and final “exams” to check for success?*
> The first check for success would be to make sure that all the existing
> Koalas APIs and tests work as they are without any affecting the existing
> Koalas workloads on PySpark.
> The last thing to confirm is to check whether the usability and convenience
> that we aim for is actually increased through user feedback and PySpark usage
> statistics.
> *Also refer to:*
> - [Koalas internals
> documentation|https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit]
> - [[VOTE] SPIP: Support pandas API layer on
> PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]