Andrew Or created SPARK-7718: -------------------------------- Summary: Speed up data source partitioning by avoiding cleaning closures Key: SPARK-7718 URL: https://issues.apache.org/jira/browse/SPARK-7718 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical
The new partitioning support strategy creates a bunch of RDDs (1 per partition, could be up to several thousands), then calls `mapPartitions` on every single one of these RDDs. This causes us to clean the same closure many times. Since we provide the closure in Spark we know for sure it is serializable, so we can bypass the cleaning for performance. According to [~yhuai] cleaning 5000 closures take up to 6-7 seconds in a 12 seconds job that involves data source partitioning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org