Andrew Or created SPARK-7718:
--------------------------------

             Summary: Speed up data source partitioning by avoiding cleaning 
closures
                 Key: SPARK-7718
                 URL: https://issues.apache.org/jira/browse/SPARK-7718
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.4.0
            Reporter: Andrew Or
            Assignee: Andrew Or
            Priority: Critical


The new partitioning support strategy creates a bunch of RDDs (1 per partition, 
could be up to several thousands), then calls `mapPartitions` on every single 
one of these RDDs. This causes us to clean the same closure many times. Since 
we provide the closure in Spark we know for sure it is serializable, so we can 
bypass the cleaning for performance.

According to [~yhuai] cleaning 5000 closures take up to 6-7 seconds in a 12 
seconds job that involves data source partitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to