Andrew Or created SPARK-7718:
--------------------------------
Summary: Speed up data source partitioning by avoiding cleaning
closures
Key: SPARK-7718
URL: https://issues.apache.org/jira/browse/SPARK-7718
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
The new partitioning support strategy creates a bunch of RDDs (1 per partition,
could be up to several thousands), then calls `mapPartitions` on every single
one of these RDDs. This causes us to clean the same closure many times. Since
we provide the closure in Spark we know for sure it is serializable, so we can
bypass the cleaning for performance.
According to [~yhuai] cleaning 5000 closures take up to 6-7 seconds in a 12
seconds job that involves data source partitioning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]