eric-haibin-lin commented on a change in pull request #15124: [MXNET-1294]
Priority-based parameter propagation for improved data parallel training
throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#discussion_r368813200
##########
File path: python/mxnet/kvstore/base.py
##########
@@ -452,4 +457,13 @@ def create(name='local'):
from .kvstore import KVStore
kv = KVStore(handle)
set_kvstore_handle(kv.handle)
- return kv
+
Review comment:
This means that the server node must have access to the training script,
being able to run to line of code that calls `kv.create()`. In comparison,
previously in theory users just need to install mxnet on the server nodes,
without worrying about other dependencies required for running the training
script.
I wonder if we can still start the server process when importing mxnet. We
can add an env var `DMLC_PS_TYPE` which is controlled by `launch.py` and we add
an extra flag `--p3` to launch.py. So users need to do `launch.py --p3 -n 2
train.py`. `DMLC_PS_TYPE` is then read when starting the server process.
We can also check the value of `DMLC_PS_TYPE` at the worker side for
consistency when a p3 store is created by the `kv.create()`. What do you think?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services