solin319 commented on issue #7893: Add barriers in kvstore init URL: https://github.com/apache/incubator-mxnet/pull/7893#issuecomment-330719121 1. I try to solve the problem in distribute training with fp16 #7554. In the function Push_, the data will convert to real_t before push to ps-lite in line 192 and 194. But static_cast<*> can't convert fp16 to fp32 correctly, it may cause memory access error. 2. In my test, I use CopyFromTo() to convert fp16 to fp32 before the send_buf push to ps-lite and convert fp32 to fp16 after receive the recv_buf from ps-lite. The code was as below. ``` void Push_(...){ ... ... if (merged.ctx().dev_mask() == cpu::kDevMask) { // make sure the previous push/pull is completed send_buf.WaitToWrite(); //when if (send_buf.is_none()) { if (storage_type == kDefaultStorage) { send_buf = NDArray(merged.shape(), pinned_ctx_, true, mshadow::DataType<real_t>::kFlag); } else { send_buf = NDArray(storage_type, merged.shape(), pinned_ctx_, true, mshadow::DataType<real_t>::kFlag); } } CopyFromTo(merged, &send_buf); //send_buf = merged; // avoid memory copy } else { if (send_buf.is_none()) { if (storage_type == kDefaultStorage) { send_buf = NDArray(merged.shape(), pinned_ctx_, true, mshadow::DataType<real_t>::kFlag); } else { send_buf = NDArray(storage_type, merged.shape(), pinned_ctx_, true, mshadow::DataType<real_t>::kFlag); } } if (merged.dtype() == mshadow::DataType<real_t>::kFlag) { CopyFromTo(merged, &send_buf); } else { //It can't convert from fp16 to fp32 across device. So I use to step CopyFromTo(). NDArray tmp = NDArray(merged.shape(), pinned_ctx_, true, merged.dtype()); CopyFromTo(merged, &tmp); CopyFromTo(tmp, &send_buf); } //CopyFromTo(merged, &send_buf); } // push to servers if (storage_type == kDefaultStorage) { auto push_to_servers = ... } ``` ``` void PullImpl(..){ ... CHECK_NOTNULL(Engine::Get())->PushAsync( pull_from_servers, pinned_ctx_, {}, {recv_buf.var()}, FnProperty::kNormal, priority, PROFILER_MESSAGE("KVStoreDistDefaultPull")); if (grouped_vals[i][0]->dtype() != mshadow::DataType<real_t>::kFlag) { NDArray tmp = NDArray(grouped_vals[i][0]->shape(), pinned_ctx_,true, grouped_vals[i][0]->dtype()); CopyFromTo(recv_buf, &tmp, 0); comm_->Broadcast(key, tmp, grouped_vals[i], priority); } else { comm_->Broadcast(key, recv_buf, grouped_vals[i], priority); } ... } ``` 3. This can solve the problem #7554, and make the problem in init mentioned before more obviously. 4. In this change, I can run distribute training with fp16 correctly. Does MXNet need this feature? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services