Looks like there's slowness in sending shuffle files, maybe one executor
get overwhelmed with all the other executors trying to pull data?
Try lifting `spark.network.timeout` further, we ourselves had to push it to
600s from the default 120s
On Thu, Sep 28, 2017 at 10:19 AM, Ilya Karpov
Hi,
I see strange behaviour in my job, and can’t understand what is wrong:
the stage that uses shuffle data as an input job fails number of times because
of org.apache.spark.shuffle.FetchFailedException seen in spark UI:
FetchFailed(BlockManagerId(8, hostname, 11431, None), shuffleId=1,