Hi,
Any input on this?  I'm willing to instrument further and experiment
if there are any ideas.

On Mon, May 4, 2015 at 11:27 AM, Akshat Aranya <aara...@gmail.com> wrote:
> Hi,
>
> I have been investigating scheduling delays in Spark and I found some
> unexplained anomalies.  In my use case, I have two stages after
> collapsing the transformations: the first is a mapPartitions() and the
> second is a sortByKey().  I found that the task serialization for the
> first stage takes much longer than the second.
>
> 1. mapPartitions() - this launches 256 tasks in 603 ms (avg. 2.363
> ms). Each task serializes to 1220 bytes.
> 2. sortByKey() - this launches 64 tasks in 12 ms (avg. 0.187 ms). Each
> task serializes to 1139 bytes.
>
> Note that the serialized size of the task is similar, but the avg.
> scheduling time is very different.  I also instrumented the code to
> print out the serialization time, and it seems like it is indeed the
> serialization that takes much longer.  This seemed weird to me because
> the biggest part of the Task, the taskBinary is actually directly
> copied from a byte array.
>
> Any explanation of why this happens?
>
> Thanks,
> Akshat

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to