[
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-37649:
---------------------------------
Priority: Critical (was: Major)
> Switch default index to distributed-sequence by default in pandas API on Spark
> ------------------------------------------------------------------------------
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.3.0
> Reporter: Hyukjin Kwon
> Assignee: Hyukjin Kwon
> Priority: Critical
> Labels: release-notes
> Fix For: 3.3.0
>
>
> pandas API on Spark currently sets {{compute.default_index_type}} to
> {{sequence}} which relies on sending all data to one executor that easily
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly
> distributes the data.
> With this change, we can now leverage
> https://issues.apache.org/jira/browse/SPARK-36559 and
> https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users
> will benefit a lot of performance improvement.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]