anandj91 commented on issue #15124: [MXNET-1294] Priority-based parameter 
propagation for improved data parallel training throughput
URL: https://github.com/apache/incubator-mxnet/pull/15124#issuecomment-542440290
 
 
   I'm facing some design level challenges to properly implement Priority based 
update (P3) on top of PushPull API. MXNet does a simple load balancing before 
pushing or pulling key-values by splitting NDArrays equally to the parameter 
servers. P3 requires a round-robin style parameter distribution which means 
slicing a large NDArray into thousands of smaller ones. Much more granular than 
current default distribution strategy and each PS would get more than one slice.
   
   With the way mxnet and ps-lite designed right now, ps-lite assumes a single 
ZPush/ZPull/ZPushPull belongs to a single layer/NDArray. It also assumes that 
one slice only belong to one PS. These assumption need to be broken for 
implementing P3. What I have done right now is to add round-robin (RR) 
distribution strategy along with the default one and use a boolean flag to 
switch between these two. When user chooses to use RR, KVStore consider each 
slice as separate key-value pair. Otherwise fallback to the default mode.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to