[
https://issues.apache.org/jira/browse/SPARK-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kay Ousterhout resolved SPARK-7058.
-----------------------------------
Resolution: Fixed
Fix Version/s: 1.4.0
> Task deserialization time metric does not include time to deserialize
> broadcasted RDDs
> --------------------------------------------------------------------------------------
>
> Key: SPARK-7058
> URL: https://issues.apache.org/jira/browse/SPARK-7058
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.1.2, 1.2.3, 1.3.1, 1.4.0
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Critical
> Fix For: 1.4.0
>
>
> The web UI's "task deserialization time" metric is slightly misleading
> because it does not capture the time taken to deserialize the broadcasted
> RDD. Currently, this statistic is measured in {{Executor.run}} by measuring
> the time to deserialize the {{Task}} instance:
> https://github.com/apache/spark/blob/bdc5c16e76c5d0bc147408353b2ba4faa8e914fc/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193
> As of Spark 1.1.0, we transfer RDDs using broadcast variables rather than
> sending them directly as part of the {{Task}} object (see SPARK-2521 for more
> details). As a result, the deserialization of the RDD is performed outside
> of this block and is not accounted for in this statistic. As a result, the
> reported task deserialization time may be a severe underestimate of the
> actual time.
> To measure actual RDD deserialization time, I hacked the following change
> into ShuffleMapTask:
> {code}
> diff --git
> a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> index 6c7d000..adab574 100644
> --- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> +++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> @@ -57,8 +57,10 @@ private[spark] class ShuffleMapTask(
> override def runTask(context: TaskContext): MapStatus = {
> // Deserialize the RDD using the broadcast variable.
> val ser = SparkEnv.get.closureSerializer.newInstance()
> + val deserializeStartTime = System.currentTimeMillis()
> val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
> ByteBuffer.wrap(taskBinary.value),
> Thread.currentThread.getContextClassLoader)
> + println(s"Deserialized a shuffle map task in
> ${System.currentTimeMillis() - deserializeStartTime} ms")
> {code}
> For one of my benchmark jobs (a SQL aggregation query that used code
> generation), the actual deserialization time was ~150ms per task even though
> the UI only reported 1ms.
> I think that this should be pretty easy to fix by simply adding additional
> calls in ShuffleMapTask and ResultTask to increment the deserialization time
> metric.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]