[ 
https://issues.apache.org/jira/browse/SPARK-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-7058.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.4.0

> Task deserialization time metric does not include time to deserialize 
> broadcasted RDDs
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-7058
>                 URL: https://issues.apache.org/jira/browse/SPARK-7058
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.1.2, 1.2.3, 1.3.1, 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Critical
>             Fix For: 1.4.0
>
>
> The web UI's "task deserialization time" metric is slightly misleading 
> because it does not capture the time taken to deserialize the broadcasted 
> RDD.  Currently, this statistic is measured in {{Executor.run}} by measuring 
> the time to deserialize the {{Task}} instance: 
> https://github.com/apache/spark/blob/bdc5c16e76c5d0bc147408353b2ba4faa8e914fc/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193
> As of Spark 1.1.0, we transfer RDDs using broadcast variables rather than 
> sending them directly as part of the {{Task}} object (see SPARK-2521 for more 
> details).  As a result, the deserialization of the RDD is performed outside 
> of this block and is not accounted for in this statistic.  As a result, the 
> reported task deserialization time may be a severe underestimate of the 
> actual time.
> To measure actual RDD deserialization time, I hacked the following change 
> into ShuffleMapTask:
> {code}
> diff --git 
> a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala 
> b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> index 6c7d000..adab574 100644
> --- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> +++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> @@ -57,8 +57,10 @@ private[spark] class ShuffleMapTask(
>    override def runTask(context: TaskContext): MapStatus = {
>      // Deserialize the RDD using the broadcast variable.
>      val ser = SparkEnv.get.closureSerializer.newInstance()
> +    val deserializeStartTime = System.currentTimeMillis()
>      val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
>        ByteBuffer.wrap(taskBinary.value), 
> Thread.currentThread.getContextClassLoader)
> +    println(s"Deserialized a shuffle map task in 
> ${System.currentTimeMillis() - deserializeStartTime} ms")
> {code}
> For one of my benchmark jobs (a SQL aggregation query that used code 
> generation), the actual deserialization time was ~150ms per task even though 
> the UI only reported 1ms.
> I think that this should be pretty easy to fix by simply adding additional 
> calls in ShuffleMapTask and ResultTask to increment the deserialization time 
> metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to