Hi Tali,

Yes I think currently the foreach API is experimental and multi-device
support is future work. The existing implementation uses the main thread to
wait for execution result and does not handle the case for data parallel
training on multi gpus. However, if you use gluon, you can probably walk
around this problem using a python multi-thread module, where each thread
performs forward operation on a GPU. I submitted a PR for such utility in
gluon-nlp: https://github.com/dmlc/gluon-nlp/pull/387/files

@Da, feel free to chime in.

Best,
Haibin

On Wed, Nov 28, 2018 at 9:18 AM Taliesin Beynon
<talies...@wolfram.com.invalid> wrote:

> Hello fellow MXNetters
>
> We've seen that the subgraph execution mechanism that is used to run
> things like the foreach operator causes MXExecutorForward to block, instead
> of just issuing the ops in the normal asynchronous way (
> https://github.com/apache/incubator-mxnet/blob/212364b0cba28aeda989378f6e630f7a61749bf3/src/executor/graph_executor.cc#L1352).
> On its own this is a surprising fact that can lead to some issues if you're
> not expecting it, like your time being spent in MXExecutorForward instead
> of WaitAll / WaitRead . Is there a reason that this process isn't just
> automatically done on a separate thread for you? Is it to ensure that
> subsequent ops on the original thread are correctly serialized wrt the ops
> produced by the foreach?
>
> More importantly, this has the unfortunate implication that if you are
> using multi-device parallelism with foreach, by just looping over your
> executors and calling Forward on them, you will inadvertently serialize
> much of the computation: you can't call Forward on the second executor
> until Forward on the first executor has returned, and the foreach causes
> that first Forward call to block until the forward pass is (mostly) done!
>
> So it kills multi-device parallelism unless one starts making thread pools
> so that the one can 'unblock' Forward (and probably the subsequent
> Backward) and have each device's Forward being run in a separate thread.
>
> Is this intended? Are we missing something about how you are supposed to
> use subgraphs in conjunction with multi-device parallelism? It seems like a
> weakness in the current design of subgraph execution. It also appears that
> the python API doesn't have any strategy to deal with this issue, as you
> can see on
> https://github.com/apache/incubator-mxnet/blob/2276bb0e30b1fe601eb288cb4f1b673484892d4b/python/mxnet/executor_manager.py#L281,
> it's not making separate threads or anything there.
>
> Thanks!
> Tali + Sebastian

Reply via email to