Hi All, This is a follow up work of [SPARK-24374 <https://issues.apache.org/jira/browse/SPARK-24374>] SPIP: Support Barrier Execution Mode in Apache Spark. https://docs.google.com/document/d/1r07-vU5JTH6s1jJ6azkmK0K5it6jwpfO6b_K3mJmxR4/edit?usp=sharing
We need to provide a communication barrier function to help coordinate tasks within a barrier stage, which is frequently required by ML/DL workloads. Similar to MPI_Barrier function in MPI, the barrier() function call blocks until all tasks in the same stage have reached this routine. The design doc proposes to implement the barrier() function based on the netty-based RPC framework in Spark, it introduces new driver side BarrierCoordinator and new BarrierCoordinatorMessage, as well as new config to handle timeout issue. Please feel free to review and discuss on the design proposal. Thanks, Xingbo