Hi,

I opened https://issues.apache.org/jira/browse/SPARK-22339 some days ago,
and I would like to get some feedback on that. The idea is pushing epoch
updates to the executors after a fetch failure by piggybacking on the
executor heartbeat response, in order to fail faster when an executor and
their shuffle blocks are lost, instead of having to wait for all fetch
retries to fail and a new task to be started on the reader executors. This
can speed up job execution, particularly when executors are lost at the end
of an stage in a Spark application with a single action at a time.There are
more details and a draft patch for this in the JIRA.

Looking forward for your feedback on this.

Greetings,

Juan

Reply via email to