[
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nitin Goyal updated SPARK-7970:
-------------------------------
Attachment: Screen Shot 2015-05-27 at 11.01.03 pm.png
> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> ----------------------------------------------------------------------
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core, SQL
> Affects Versions: 1.2.0, 1.3.0
> Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union
> of RDDs. The time increases linearly at driver side with number of RDDs
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting
> consumed in "getClassReader" method of ClosureCleaner and rest in
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls
> ClosureCleaner clean method.
> 2. Fix at Spark core level -
> (i) Make "checkSerializable" property driven in SparkContext's clean method
> (ii) Somehow cache classreader for last 'n' classes
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]