Hi,

I have been investigating scheduling delays in Spark and I found some
unexplained anomalies.  In my use case, I have two stages after
collapsing the transformations: the first is a mapPartitions() and the
second is a sortByKey().  I found that the task serialization for the
first stage takes much longer than the second.

1. mapPartitions() - this launches 256 tasks in 603 ms (avg. 2.363
ms). Each task serializes to 1220 bytes.
2. sortByKey() - this launches 64 tasks in 12 ms (avg. 0.187 ms). Each
task serializes to 1139 bytes.

Note that the serialized size of the task is similar, but the avg.
scheduling time is very different.  I also instrumented the code to
print out the serialization time, and it seems like it is indeed the
serialization that takes much longer.  This seemed weird to me because
the biggest part of the Task, the taskBinary is actually directly
copied from a byte array.

Any explanation of why this happens?

Thanks,
Akshat

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to