Re: Massive fetch fails, io errors in TransportRequestHandler

2017-09-28 Thread Vadim Semenov
Looks like there's slowness in sending shuffle files, maybe one executor get overwhelmed with all the other executors trying to pull data? Try lifting `spark.network.timeout` further, we ourselves had to push it to 600s from the default 120s On Thu, Sep 28, 2017 at 10:19 AM, Ilya Karpov

Massive fetch fails, io errors in TransportRequestHandler

2017-09-28 Thread Ilya Karpov
Hi, I see strange behaviour in my job, and can’t understand what is wrong: the stage that uses shuffle data as an input job fails number of times because of org.apache.spark.shuffle.FetchFailedException seen in spark UI: FetchFailed(BlockManagerId(8, hostname, 11431, None), shuffleId=1,