[
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548784#comment-16548784
]
Jiang Xingbo commented on SPARK-24375:
--------------------------------------
{quote}Is the 'barrier' logic pluggable ? Instead of only being a global sync
point.
{quote}
The barrier() function is quite like
[MPI_Barrier|https://www.mpich.org/static/docs/v3.2.1/www/www3/MPI_Barrier.html]
function in MPI, the major purpose is to provide a way to do global sync
between barrier tasks. I'm not sure whether we have plan to support pluggable
logic for now, do you have a case in hand that require pluggable barrier() ?
{quote}Dynamic resource allocation (dra) triggers allocation of additional
resources based on pending tasks - hence the comment We may add a check of
total available slots before scheduling tasks from a barrier stage taskset.
does not necessarily work in that context.
{quote}
Support running barrier stage with dynamic resource allocation is a Non-Goal
here, however, we can improve the behavior to integrate better with DRA in
Spark 3.0 .
{quote}Currently DRA in spark uniformly allocates resources - are we
envisioning changes as part of this effort to allocate heterogenous executor
resources based on pending tasks (atleast initially for barrier support for
gpu's) ?
{quote}
There is another ongoing SPIP SPARK-24615 to add accelerator-aware task
scheduling for Spark, I think we shall deal with the above issue within that
topic.
{quote}In face of exceptions, some tasks will wait on barrier 2 and others on
barrier 1 : causing issues.{quote}
It's not desired behavior to catch exception thrown by TaskContext.barrier()
silently. However, in case this really happens, we can detect that because we
have `epoch` both in driver side and executor side, more details will go to the
design doc of BarrierTaskContext.barrier() SPARK-24581
{quote}Can you elaborate more on leveraging TaskContext.localProperties ? Is
it expected to be sync'ed after 'barrier' returns ? What gaurantees are we
expecting to provide ?{quote}
We update the localProperties in driver and in executors you shall be able to
fetch the updated values through TaskContext, it should not couple with
`barrier()` function.
> Design sketch: support barrier scheduling in Apache Spark
> ---------------------------------------------------------
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
> Issue Type: Story
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Xiangrui Meng
> Assignee: Jiang Xingbo
> Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP
> discussion. It doesn't need to be a complete design before the vote. But it
> should at least cover both Scala/Java and PySpark.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]