[GitHub] [incubator-mxnet] eric-haibin-lin commented on a change in pull request #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

GitBox Thu, 06 Jun 2019 23:03:22 -0700

eric-haibin-lin commented on a change in pull request #15124: [MXNET-1294] 
Priority-based parameter propagation for improved data parallel training 
throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#discussion_r291460950


 ##########
 File path: src/kvstore/kvstore_dist.h
 ##########
 @@ -412,17 +439,38 @@ class KVStoreDist : public KVStoreLocal {
 
   void PushDefault(int key, const NDArray &send_buf, const PSKV& pskv, int 
priority) {
     auto push_to_servers =
-        [this, key, pskv, send_buf](RunContext rctx, 
Engine::CallbackOnComplete cb) {
+        [this, key, pskv, send_buf, priority](RunContext rctx, 
Engine::CallbackOnComplete cb) {
           const int dtype = send_buf.dtype();
           // convert to ps keys
           const size_t size = send_buf.shape().Size() * 
mshadow::mshadow_sizeof(dtype);
           char* data = static_cast<char *>(send_buf.data().dptr_);
           // do push. false means no delete
           ps::SArray<char> vals(data, size, false);
           int cmd = GetCommandType(RequestType::kDefaultPushPull, dtype);
-          CHECK_NOTNULL(ps_worker_)->ZPush(
-              pskv.keys, vals, pskv.lens,
-              cmd, [cb]() { cb(); });
+
+          auto *counter = new std::atomic<int>(pskv.keys.size());
+          int len = 0;
+          for (size_t i = 0; i < pskv.keys.size(); i++) {
+            auto ks = new ps::SArray<ps::Key>(std::move(pskv.keys.segment(i, 
i+1)));
+            auto vs = new ps::SArray<char>(std::move(vals.segment(len, 
len+pskv.lens[i])));
+            auto ls = new ps::SArray<int>(std::move(pskv.lens.segment(i, 
i+1)));
+            CHECK_NOTNULL(ps_worker_)->ZPush(*ks, *vs, *ls,
+                    static_cast<int>(RequestType::kDefaultPushPull),
 
 Review comment:
   One way is to introduce a kv.pushpull API that does these two things in one 
shot, although we are trying to avoiding adding new ones to keep the API simple 
and clean. 
   
   Another potential solution, is that we can probably leverage the priority in 
the Engine, instead of hard-coding ZPull in the callback function. The default 
engine in MXNet can put a task to a priority queue with kXCPUPrioritized 
property: 
https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine_perdevice.cc#L106-L108
   
   What we need is to schedule/invoke the ZPull function as soon as the ZPush 
for a chunk of the parameter is done. If the engine prioritizes the pull 
operation of shallower layers (layers close to input data) over push operations 
of deeper layers (layers far away from input data), then we can use 
Engine::Push for `ZPull` and `ZPush` separately.
   
   To schedule small NDArray chunks instead of a huge NDArray, the send/receive 
buffer have to be divided into multiple small NDArrays. The only concern is 
that, maintaining lots of small NDArrays **may** cause some overhead. For 
example, an embedding layer of size 30K * 1K may lead to thousands of chunks. 
   
   @junrushao1994 did you have any benchmark showing the overhead of 
`Engine::Push`, especially in the case of pushing a thousand operations? How 
much overhead would that be? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] eric-haibin-lin commented on a change in pull request #15124: [MXNET-1294] Priority-based parameter propagation for improved data parallel training throughput

Reply via email to