solin319 commented on issue #7893: Add barriers in kvstore init
URL: https://github.com/apache/incubator-mxnet/pull/7893#issuecomment-330719121
 
 
   1. I try to solve the problem in distribute training with fp16 #7554.
   In the function Push_, the data will convert to real_t before push to 
ps-lite in line 192 and 194. But static_cast<*> can't convert fp16 to fp32 
correctly, it may cause  memory access error.
   2. In my test, I use CopyFromTo() to convert fp16 to fp32 before the 
send_buf push to ps-lite and convert fp32 to fp16 after receive the recv_buf 
from ps-lite.
   
   The code was as below.
   
   ```
   void Push_(...){
   ...
   ...
   if (merged.ctx().dev_mask() == cpu::kDevMask) {
           // make sure the previous push/pull is completed
           send_buf.WaitToWrite();
           //when 
           if (send_buf.is_none()) {
             if (storage_type == kDefaultStorage) {
               send_buf = NDArray(merged.shape(), pinned_ctx_, true, 
mshadow::DataType<real_t>::kFlag);
             } else {
               send_buf = NDArray(storage_type, merged.shape(), pinned_ctx_, 
true, mshadow::DataType<real_t>::kFlag);
             }
           }
           CopyFromTo(merged, &send_buf);
   
           //send_buf = merged;  // avoid memory copy
         } else {
           if (send_buf.is_none()) {
             if (storage_type == kDefaultStorage) {
               send_buf = NDArray(merged.shape(), pinned_ctx_, true, 
mshadow::DataType<real_t>::kFlag);
             } else {
               send_buf = NDArray(storage_type, merged.shape(), pinned_ctx_, 
true, mshadow::DataType<real_t>::kFlag);
             }
           }
           if (merged.dtype() == mshadow::DataType<real_t>::kFlag) {
             CopyFromTo(merged, &send_buf);
           } else {
             //It can't convert from fp16 to fp32 across device. So I use to 
step CopyFromTo().
             NDArray tmp = NDArray(merged.shape(), pinned_ctx_, true, 
merged.dtype());
             CopyFromTo(merged, &tmp);
             CopyFromTo(tmp, &send_buf);
           }
           //CopyFromTo(merged, &send_buf);
         }
         // push to servers
         if (storage_type == kDefaultStorage) {
         auto push_to_servers =
   ...
   }
   ```
   
   ```
   void PullImpl(..){
   ...
   CHECK_NOTNULL(Engine::Get())->PushAsync(
             pull_from_servers,
             pinned_ctx_,
             {},
             {recv_buf.var()},
             FnProperty::kNormal,
             priority,
             PROFILER_MESSAGE("KVStoreDistDefaultPull"));
         if (grouped_vals[i][0]->dtype() != mshadow::DataType<real_t>::kFlag) {
           NDArray tmp = NDArray(grouped_vals[i][0]->shape(), pinned_ctx_,true, 
grouped_vals[i][0]->dtype());
           CopyFromTo(recv_buf, &tmp, 0);
           comm_->Broadcast(key, tmp, grouped_vals[i], priority);
         }
         else {
           comm_->Broadcast(key, recv_buf, grouped_vals[i], priority);
         }
   ...
   }
   ```
   
   3. This can solve the problem #7554, and make the problem in init mentioned 
before more obviously.
   4. In this change, I can run distribute training with fp16 correctly. Does 
MXNet need this feature?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to