Hi, I have been investigating scheduling delays in Spark and I found some unexplained anomalies. In my use case, I have two stages after collapsing the transformations: the first is a mapPartitions() and the second is a sortByKey(). I found that the task serialization for the first stage takes much longer than the second.
1. mapPartitions() - this launches 256 tasks in 603 ms (avg. 2.363 ms). Each task serializes to 1220 bytes. 2. sortByKey() - this launches 64 tasks in 12 ms (avg. 0.187 ms). Each task serializes to 1139 bytes. Note that the serialized size of the task is similar, but the avg. scheduling time is very different. I also instrumented the code to print out the serialization time, and it seems like it is indeed the serialization that takes much longer. This seemed weird to me because the biggest part of the Task, the taskBinary is actually directly copied from a byte array. Any explanation of why this happens? Thanks, Akshat --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org