[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-04-09 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r180296527
 
 

 ##
 File path: amalgamation/Makefile
 ##
 @@ -23,6 +23,11 @@ ifndef OPENBLAS_ROOT
 export OPENBLAS_ROOT=/usr/local/opt/openblas
 endif
 
+# use F16C if the architecture supports it, turned off by default
+ifndef USE_F16C
 
 Review comment:
   what does 16C stand for?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-04-02 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r178665723
 
 

 ##
 File path: src/kvstore/kvstore_dist_server.h
 ##
 @@ -170,43 +216,90 @@ class KVStoreDistServer {
 app->Response(recved);
   }
 
+  /*
+   * For keys already initialized, if necessary create stored_realt.
+   * This will only be used if by some wrong usage of kvstore,
+   * some keys are initialized before optimizer is set.
+   */
+  void CreateMultiPrecisionCopies() {
+for (auto const& stored_entry : store_) {
+  const int key = stored_entry.first;
+  const NDArray& stored = stored_entry.second;
+  if (stored.dtype() != mshadow::kFloat32) {
+auto& stored_realt = store_realt_[key];
+if (stored.storage_type() == kRowSparseStorage) {
+  stored_realt = NDArray(kRowSparseStorage, stored.shape(), 
stored.ctx(),
+  true, mshadow::kFloat32);
+} else {
+  stored_realt = NDArray(stored.shape(), stored.ctx(), false, 
mshadow::kFloat32);
+}
+
+auto& update = update_buf_[key];
+if (!update.merged.is_none()) {
+  update.merged = NDArray(update.merged.shape(), update.merged.ctx(), 
false,
+  mshadow::kFloat32);
+}
+CHECK(update.request.size() == 0)
+  << ps::MyRank() << "Multiprecision mode can not be set while pushes 
are underway."
+  << "Please set optimizer before pushing keys." << key << " " << 
update.request.size();
+
+CopyFromTo(stored, stored_realt);
+  }
+}
+for (auto const& stored_realt_entry : store_realt_) {
+  stored_realt_entry.second.WaitToRead();
+}
+  }
+
   void DataHandleEx(const ps::KVMeta& req_meta,
-const ps::KVPairs& req_data,
-ps::KVServer* server) {
-DataHandleType recved_type = static_cast(req_meta.cmd);
-if (recved_type == DataHandleType::kRowSparsePushPull) {
-  DataHandleRowSparse(req_meta, req_data, server);
-} else if (recved_type == DataHandleType::kCompressedPushPull) {
-  DataHandleCompressed(req_meta, req_data, server);
-} else {
-  DataHandleDefault(req_meta, req_data, server);
+const ps::KVPairs& req_data,
+ps::KVServer* server) {
+DataHandleType type = DepairDataHandleType(req_meta.cmd);
+switch (type.requestType) {
+  case RequestType::kRowSparsePushPull:
+DataHandleRowSparse(type, req_meta, req_data, server);
+break;
+  case RequestType::kCompressedPushPull:
+DataHandleCompressed(type, req_meta, req_data, server);
+break;
+  case RequestType::kDefaultPushPull:
+DataHandleDefault(type, req_meta, req_data, server);
+break;
 }
-return;
   }
 
-  inline void ApplyUpdates(const int key, MergeBuf *merged, NDArray *stored,
-   ps::KVServer* server) {
-if (merged->request.size() == (size_t) ps::NumWorkers()) {
+  inline bool has_multi_precision_copy(const DataHandleType type) {
+return multi_precision_ && type.dtype != mshadow::kFloat32;
+  }
+
+  inline void ApplyUpdates(const DataHandleType type, const int key,
+   UpdateBuf *updateBuf, ps::KVServer* server) {
+if (!sync_mode_ || updateBuf->request.size() == (size_t) ps::NumWorkers()) 
{
 
 Review comment:
   nit: `updateBuf ` -> `update_buf `?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-04-02 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r178665130
 
 

 ##
 File path: src/kvstore/kvstore_dist_server.h
 ##
 @@ -170,43 +216,90 @@ class KVStoreDistServer {
 app->Response(recved);
   }
 
+  /*
+   * For keys already initialized, if necessary create stored_realt.
+   * This will only be used if by some wrong usage of kvstore,
+   * some keys are initialized before optimizer is set.
+   */
+  void CreateMultiPrecisionCopies() {
+for (auto const& stored_entry : store_) {
+  const int key = stored_entry.first;
+  const NDArray& stored = stored_entry.second;
+  if (stored.dtype() != mshadow::kFloat32) {
+auto& stored_realt = store_realt_[key];
+if (stored.storage_type() == kRowSparseStorage) {
+  stored_realt = NDArray(kRowSparseStorage, stored.shape(), 
stored.ctx(),
+  true, mshadow::kFloat32);
 
 Review comment:
   nit: alignment 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-04-02 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r178665347
 
 

 ##
 File path: src/kvstore/kvstore_dist_server.h
 ##
 @@ -170,43 +216,90 @@ class KVStoreDistServer {
 app->Response(recved);
   }
 
+  /*
+   * For keys already initialized, if necessary create stored_realt.
+   * This will only be used if by some wrong usage of kvstore,
+   * some keys are initialized before optimizer is set.
+   */
+  void CreateMultiPrecisionCopies() {
+for (auto const& stored_entry : store_) {
+  const int key = stored_entry.first;
+  const NDArray& stored = stored_entry.second;
+  if (stored.dtype() != mshadow::kFloat32) {
+auto& stored_realt = store_realt_[key];
+if (stored.storage_type() == kRowSparseStorage) {
+  stored_realt = NDArray(kRowSparseStorage, stored.shape(), 
stored.ctx(),
+  true, mshadow::kFloat32);
+} else {
+  stored_realt = NDArray(stored.shape(), stored.ctx(), false, 
mshadow::kFloat32);
+}
+
+auto& update = update_buf_[key];
+if (!update.merged.is_none()) {
+  update.merged = NDArray(update.merged.shape(), update.merged.ctx(), 
false,
 
 Review comment:
   could `update.merged` also be sparse?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-04-02 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r178666240
 
 

 ##
 File path: src/kvstore/kvstore_dist_server.h
 ##
 @@ -220,175 +313,229 @@ class KVStoreDistServer {
 }
   }
 
-  void DataHandleRowSparse(const ps::KVMeta& req_meta,
-   const ps::KVPairs& req_data,
-   ps::KVServer* server) {
+  void AccumulateRowSparseGrads(const DataHandleType type,
+const NDArray& recved,
+UpdateBuf* updateBuf) {
+NDArray out(kRowSparseStorage, updateBuf->merged.shape(), Context(), true,
+has_multi_precision_copy(type) ? mshadow::kFloat32 : 
type.dtype);
+if (has_multi_precision_copy(type)) CopyFromTo(recved, 
updateBuf->temp_array);
+const NDArray& to_merge = has_multi_precision_copy(type) ? 
updateBuf->temp_array : recved;
+// accumulate row_sparse gradients
+// TODO(haibin) override + operator for row_sparse NDArray
+// instead of calling BinaryComputeRspRsp directly
+using namespace mshadow;
+Engine::Get()->PushAsync(
+[to_merge, updateBuf, out](RunContext ctx, Engine::CallbackOnComplete 
on_complete) {
+  op::ElemwiseBinaryOp::ComputeEx(
+  {}, {}, {to_merge, updateBuf->merged}, {kWriteTo}, {out});
+  on_complete();
+}, to_merge.ctx(), {to_merge.var(), updateBuf->merged.var()}, {out.var()},
+FnProperty::kNormal, 0, PROFILER_MESSAGE_FUNCNAME);
+CopyFromTo(out, &(updateBuf->merged), 0);
+updateBuf->merged.WaitToRead();
+  }
+
+  void RowSparsePullResponse(const DataHandleType type,
+ const int master_key,
+ const size_t num_rows,
+ const ps::KVMeta& req_meta,
+ const ps::KVPairs& req_data,
+ ps::KVServer* server) {
+if (log_verbose_) LOG(INFO) << "pull: " << master_key;
+ps::KVPairs response;
+if (num_rows == 0) {
+  std::vector lens(req_data.keys.size(), 0);
+  response.keys = req_data.keys;
+  response.lens.CopyFrom(lens.begin(), lens.end());
+  server->Response(req_meta, response);
+  return;
+}
+const NDArray& stored = store_[master_key];
+if (has_multi_precision_copy(type)) stored.WaitToRead();
 
 Review comment:
   why need to wait on `stored`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-04-02 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r178646489
 
 

 ##
 File path: python/mxnet/kvstore.py
 ##
 @@ -474,6 +474,8 @@ def set_optimizer(self, optimizer):
 except:
 raise
 self._send_command_to_servers(0, optim_str)
+if optimizer.multi_precision:
+self._send_command_to_servers(1, '')
 
 Review comment:
   Can we use meaningful CONST VAR instead of magic numbers?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r176900420
 
 

 ##
 File path: src/kvstore/kvstore_dist.h
 ##
 @@ -506,41 +514,46 @@ class KVStoreDist : public KVStoreLocal {
   /**
* \brief convert to keys in ps
*/
-  inline PSKV& EncodeDefaultKey(int key, size_t size, bool is_push) {
+  inline PSKV& EncodeDefaultKey(const int key, const size_t num_arr_elems,
 
 Review comment:
   Do you mind completing the param documentation for this function?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
eric-haibin-lin commented on a change in pull request #10183: [MXNET-120] 
Float16 support for distributed training
URL: https://github.com/apache/incubator-mxnet/pull/10183#discussion_r176900404
 
 

 ##
 File path: src/kvstore/kvstore_dist.h
 ##
 @@ -228,17 +228,18 @@ class KVStoreDist : public KVStoreLocal {
   RunContext rctx, Engine::CallbackOnComplete cb) {
 // convert to ps keys
 size_t size = recv_buf.shape().Size();
-
+int dtype = recv_buf.dtype();
+int elem_size = mshadow::mshadow_sizeof(dtype);
 
 Review comment:
   would `num_bytes` be a better name?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services