[
https://issues.apache.org/jira/browse/SPARK-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577893#comment-16577893
]
zhengruifeng commented on SPARK-24581:
--------------------------------------
It maybe meaningful to support resettable iterator in BarrierTaskContext, if
the RDD is cached.
In BarrierTaskContext, other distributed systems like MPI may be applyed, and
it is common to iterate the partition many times. Current mapPartitions after
barrier do not support iterations, and it is up to the users to cache the
partition.
An example is the XGBoost On Spark:
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala#L124
XGboost has to create tmp file to store on external memory, even if the total
dataset is already cached.
> Design: BarrierTaskContext.barrier()
> ------------------------------------
>
> Key: SPARK-24581
> URL: https://issues.apache.org/jira/browse/SPARK-24581
> Project: Spark
> Issue Type: Story
> Components: ML, Spark Core
> Affects Versions: 3.0.0
> Reporter: Xiangrui Meng
> Assignee: Jiang Xingbo
> Priority: Major
>
> We need to provide a communication barrier function to users to help
> coordinate tasks within a barrier stage. This is very similar to MPI_Barrier
> function in MPI. This story is for its design.
>
> Requirements:
> * Low-latency. The tasks should be unblocked soon after all tasks have
> reached this barrier. The latency is more important than CPU cycles here.
> * Support unlimited timeout with proper logging. For DL tasks, it might take
> very long to converge, we should support unlimited timeout with proper
> logging. So users know why a task is waiting.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]