[ https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16516411#comment-16516411 ]
Mridul Muralidharan edited comment on SPARK-24375 at 6/18/18 10:17 PM: ----------------------------------------------------------------------- [~jiangxb1987] A couple of comments based on the document and your elaboration above: * Is the 'barrier' logic pluggable ? Instead of only being a global sync point. * Dynamic resource allocation (dra) triggers allocation of additional resources based on pending tasks - hence the comment _We may add a check of total available slots before scheduling tasks from a barrier stage taskset._ does not necessarily work in that context. * Currently DRA in spark uniformly allocates resources - are we envisioning changes as part of this effort to allocate heterogenous executor resources based on pending tasks (atleast initially for barrier support for gpu's) ? * How is fault tolerance handled w.r.t waiting on incorrect barriers ? Any way to identify the barrier ? Example: {code} try { ... snippet A ... // Barrier 1 context.barrier() ... snippet B ... } catch { ... } ... snippet C ... // Barrier 2 context.barrier() {code} ** In face of exceptions, some tasks will wait on barrier 2 and others on barrier 1 : causing issues. * Can you elaborate more on leveraging TaskContext.localProperties ? Is it expected to be sync'ed after 'barrier' returns ? What gaurantees are we expecting to provide ? was (Author: mridulm80): [~jiangxb1987] A couple of comments based on the document and your elaboration above: * Is the 'barrier' logic pluggable ? Instead of only being a global sync point. * Dynamic resource allocation (dra) triggers allocation of additional resources based on pending tasks - hence the comment _We may add a check of total available slots before scheduling tasks from a barrier stage taskset._ does not necessarily work in that context. * Currently DRA in spark uniformly allocates resources - are we envisioning changes as part of this effort to allocate heterogenous executor resources based on pending tasks (atleast initially for barrier support for gpu's) ? * How is fault tolerance handled w.r.t waiting on incorrect barriers ? Any way to identify the barrier ? Example: {code} try { ... snippet A ... // Barrier 1 context.barrier() ... snippet B ... } catch { ... } ... snippet C ... // Barrier 2 context.barrier() {code} ** In face of exceptions, some tasks will wait on barrier 2 and others on barrier 1 : causing issues. * > Design sketch: support barrier scheduling in Apache Spark > --------------------------------------------------------- > > Key: SPARK-24375 > URL: https://issues.apache.org/jira/browse/SPARK-24375 > Project: Spark > Issue Type: Story > Components: Spark Core > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Assignee: Jiang Xingbo > Priority: Major > > This task is to outline a design sketch for the barrier scheduling SPIP > discussion. It doesn't need to be a complete design before the vote. But it > should at least cover both Scala/Java and PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org