Nitin Goyal created SPARK-7970:
----------------------------------

             Summary: Optimize code for SQL queries fired on Union of RDDs 
(closure cleaner)
                 Key: SPARK-7970
                 URL: https://issues.apache.org/jira/browse/SPARK-7970
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
    Affects Versions: 1.3.0, 1.2.0
            Reporter: Nitin Goyal


Closure cleaner slows down the execution of Spark SQL queries fired on union of 
RDDs. The time increases linearly at driver side with number of RDDs unioned. 
Refer following thread for more context :-

http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html

As can be seen in attached screenshots of Jprofiler, lot of time is getting 
consumed in "getClassReader" method of ClosureCleaner and rest in 
"ensureSerializable" (atleast in my case)

This can be fixed in two ways (as per my current understanding) :-

1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
ClosureCleaner clean method.

2. Fix at Spark core level -
  (i) Make "checkSerializable" property driven in SparkContext's clean method
  (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to