eric-haibin-lin commented on a change in pull request #15124: [MXNET-1294]
Priority-based parameter propagation for improved data parallel training
throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#discussion_r291460950
##########
File path: src/kvstore/kvstore_dist.h
##########
@@ -412,17 +439,38 @@ class KVStoreDist : public KVStoreLocal {
void PushDefault(int key, const NDArray &send_buf, const PSKV& pskv, int
priority) {
auto push_to_servers =
- [this, key, pskv, send_buf](RunContext rctx,
Engine::CallbackOnComplete cb) {
+ [this, key, pskv, send_buf, priority](RunContext rctx,
Engine::CallbackOnComplete cb) {
const int dtype = send_buf.dtype();
// convert to ps keys
const size_t size = send_buf.shape().Size() *
mshadow::mshadow_sizeof(dtype);
char* data = static_cast<char *>(send_buf.data().dptr_);
// do push. false means no delete
ps::SArray<char> vals(data, size, false);
int cmd = GetCommandType(RequestType::kDefaultPushPull, dtype);
- CHECK_NOTNULL(ps_worker_)->ZPush(
- pskv.keys, vals, pskv.lens,
- cmd, [cb]() { cb(); });
+
+ auto *counter = new std::atomic<int>(pskv.keys.size());
+ int len = 0;
+ for (size_t i = 0; i < pskv.keys.size(); i++) {
+ auto ks = new ps::SArray<ps::Key>(std::move(pskv.keys.segment(i,
i+1)));
+ auto vs = new ps::SArray<char>(std::move(vals.segment(len,
len+pskv.lens[i])));
+ auto ls = new ps::SArray<int>(std::move(pskv.lens.segment(i,
i+1)));
+ CHECK_NOTNULL(ps_worker_)->ZPush(*ks, *vs, *ls,
+ static_cast<int>(RequestType::kDefaultPushPull),
Review comment:
One way is to introduce a kv.pushpull API that does these two things in one
shot, although we are trying to avoiding adding new ones to keep the API simple
and clean.
Another potential solution, is that we can probably leverage the priority in
the Engine, instead of hard-coding ZPull in the callback function. The default
engine in MXNet can put a task to a priority queue with kXCPUPrioritized
property:
https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine_perdevice.cc#L106-L108
What we need is to schedule/invoke the ZPull function as soon as the ZPush
for a chunk of the parameter is done. If the engine prioritizes the pull
operation of shallower layers (layers close to input data) over push operations
of deeper layers (layers far away from input data), then we can use
Engine::Push for `ZPull` and `ZPush` separately.
To schedule small NDArray chunks instead of a huge NDArray, the send/receive
buffer have to be divided into multiple small NDArrays. The only concern is
that, maintaining lots of small NDArrays **may** cause some overhead. For
example, an embedding layer of size 30K * 1K may lead to thousands of chunks.
@junrushao1994 did you have any benchmark showing the overhead of
`Engine::Push`, especially in the case of pushing a thousand operations? How
much overhead would that be?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services