Re: [DESIGN] Barrier Execution Mode

2018-07-08 Thread Reynold Xin
Xingbo,

Please reference the spip and jira ticket next time:  [SPARK-24374] SPIP:
Support Barrier Scheduling in Apache Spark

On Sun, Jul 8, 2018 at 9:45 AM Xingbo Jiang  wrote:

> Hi All,
>
> I would like to invite you to review the design document for Barrier
> Execution Mode:
>
> https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#
>
> TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a
> major part of the project involves significant changes to execution mode of
> Spark. This design doc proposes new APIs as well as new execution mode
> (known as barrier execution mode) to provide high-performance support for
> DL workloads.
>
> Major changes include:
>
>- Add RDDBarrier to support gang scheduling.
>- Add BarrierTaskContext to support global sync of all tasks in a
>stage;
>- Better fault tolerance approach for barrier stage, that in case some
>tasks fail in the middle, retry all tasks in the same stage.
>- Integrate barrier execution mode with Standalone cluster manager.
>
> Please feel free to review and discuss on the design proposal.
>
> Thanks,
> Xingbo
>
>


[DESIGN] Barrier Execution Mode

2018-07-08 Thread Xingbo Jiang
Hi All,

I would like to invite you to review the design document for Barrier
Execution Mode:
https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#

TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a major
part of the project involves significant changes to execution mode of
Spark. This design doc proposes new APIs as well as new execution mode
(known as barrier execution mode) to provide high-performance support for
DL workloads.

Major changes include:

   - Add RDDBarrier to support gang scheduling.
   - Add BarrierTaskContext to support global sync of all tasks in a stage;
   - Better fault tolerance approach for barrier stage, that in case some
   tasks fail in the middle, retry all tasks in the same stage.
   - Integrate barrier execution mode with Standalone cluster manager.

Please feel free to review and discuss on the design proposal.

Thanks,
Xingbo