[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-29 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164556307
 
 

 ##
 File path: python/mxnet/optimizer.py
 ##
 @@ -645,6 +645,195 @@ def update(self, index, weight, grad, state):
 ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight,
 lr=lr, wd=wd, **kwargs)
 
+@register
+class LBSGD(Optimizer):
+"""The Large Batch SGD optimizer with momentum and weight decay.
+
+The optimizer updates the weight by::
+
+state = momentum * state + lr * rescale_grad * clip(grad, 
clip_gradient) + wd * weight
+weight = weight - state
+
+For details of the update algorithm see 
:class:`~mxnet.ndarray.lbsgd_update` and
+:class:`~mxnet.ndarray.lbsgd_mom_update`.
+
+This optimizer accepts the following parameters in addition to those 
accepted
+by :class:`.Optimizer`.
+
+Parameters
+--
+momentum : float, optional
+   The momentum value.
+multi_precision: bool, optional
+   Flag to control the internal precision of the optimizer.
+   ``False`` results in using the same precision as the weights (default),
+   ``True`` makes internal 32-bit copy of the weights and applies gradients
+in 32-bit precision even if actual weights used in the model 
have lower precision.`<
+Turning this on can improve convergence and accuracy when 
training with float16.
+warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars'   default : 
'linear')
+warmup_epochs: unsigned, default: 5
+batch_scale:   unsigned, default: 1 (same as batch size*numworkers)
+updates_per_epoch: updates_per_epoch (default: 32, Default might not 
reflect true number batches per epoch. Used for warmup.)
+begin_epoch: unsigned, default 0, starting epoch.
 
 Review comment:
   @ashokei please add more details describing the strategy.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-29 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164555827
 
 

 ##
 File path: python/mxnet/optimizer.py
 ##
 @@ -645,6 +645,195 @@ def update(self, index, weight, grad, state):
 ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight,
 lr=lr, wd=wd, **kwargs)
 
+@register
+class LBSGD(Optimizer):
+"""The Large Batch SGD optimizer with momentum and weight decay.
+
+The optimizer updates the weight by::
+
+state = momentum * state + lr * rescale_grad * clip(grad, 
clip_gradient) + wd * weight
+weight = weight - state
+
+For details of the update algorithm see 
:class:`~mxnet.ndarray.lbsgd_update` and
+:class:`~mxnet.ndarray.lbsgd_mom_update`.
+
+This optimizer accepts the following parameters in addition to those 
accepted
+by :class:`.Optimizer`.
+
+Parameters
+--
+momentum : float, optional
+   The momentum value.
+multi_precision: bool, optional
+   Flag to control the internal precision of the optimizer.
+   ``False`` results in using the same precision as the weights (default),
+   ``True`` makes internal 32-bit copy of the weights and applies gradients
+in 32-bit precision even if actual weights used in the model 
have lower precision.`<
+Turning this on can improve convergence and accuracy when 
training with float16.
+warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars'   default : 
'linear')
+warmup_epochs: unsigned, default: 5
+batch_scale:   unsigned, default: 1 (same as batch size*numworkers)
+updates_per_epoch: updates_per_epoch (default: 32, Default might not 
reflect true number batches per epoch. Used for warmup.)
 
 Review comment:
   I guess it requires the epoch number to stop warming up, which does not 
depend on the number of updates. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-29 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164555449
 
 

 ##
 File path: python/mxnet/optimizer.py
 ##
 @@ -645,6 +645,195 @@ def update(self, index, weight, grad, state):
 ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight,
 lr=lr, wd=wd, **kwargs)
 
+@register
+class LBSGD(Optimizer):
+"""The Large Batch SGD optimizer with momentum and weight decay.
+
+The optimizer updates the weight by::
+
+state = momentum * state + lr * rescale_grad * clip(grad, 
clip_gradient) + wd * weight
+weight = weight - state
+
+For details of the update algorithm see 
:class:`~mxnet.ndarray.lbsgd_update` and
+:class:`~mxnet.ndarray.lbsgd_mom_update`.
+
+This optimizer accepts the following parameters in addition to those 
accepted
+by :class:`.Optimizer`.
+
+Parameters
+--
+momentum : float, optional
+   The momentum value.
+multi_precision: bool, optional
+   Flag to control the internal precision of the optimizer.
+   ``False`` results in using the same precision as the weights (default),
+   ``True`` makes internal 32-bit copy of the weights and applies gradients
+in 32-bit precision even if actual weights used in the model 
have lower precision.`<
+Turning this on can improve convergence and accuracy when 
training with float16.
+warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars'   default : 
'linear')
+warmup_epochs: unsigned, default: 5
+batch_scale:   unsigned, default: 1 (same as batch size*numworkers)
+updates_per_epoch: updates_per_epoch (default: 32, Default might not 
reflect true number batches per epoch. Used for warmup.)
+begin_epoch: unsigned, default 0, starting epoch.
+"""
+
+def __init__(self, momentum=0.0, multi_precision=False, 
warmup_strategy='linear',
+ warmup_epochs=5, batch_scale=1, updates_per_epoch=32, 
begin_epoch=0, num_epochs=60,
+ **kwargs):
+super(LBSGD, self).__init__(**kwargs)
+logging.info('Running Large-Batch SGD Algorithm')
+logging.info('(Batch_scale=%f, warmup_epochs=%d, warmup_strategy=%s, 
updates_per_epoch=%d)',
+ batch_scale, warmup_epochs, warmup_strategy, 
updates_per_epoch)
+self.momentum = momentum
+self.multi_precision = multi_precision
+# new user parameters for large batch
+self.warmup_strategy = warmup_strategy
+self.warmup_epochs = warmup_epochs
+self.batch_scale = batch_scale
+self.updates_per_epoch = updates_per_epoch
+self.init_updates = begin_epoch * updates_per_epoch
+self.num_epochs = num_epochs
+# addl internal usage parameters and storage
+self.lbmult = 1
+self.cumgrads = {}
+# for adaptive lr
+self.adaptive = False
+self.admult = 1  # adaptation constant
+
+def create_state(self, index, weight):
 
 Review comment:
   @ashokei As suggested, could you change to inherit SGD and override 
`create_state_multi_precision`, `create_state`, `update`, 
`update_multi_precision` only if necessary.  Seems like you are mixing 
multi_precision part into the normal one.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-26 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164265033
 
 

 ##
 File path: python/mxnet/lr_scheduler.py
 ##
 @@ -136,3 +136,38 @@ def __call__(self, num_update):
 else:
 return self.base_lr
 return self.base_lr
+
+class PolyScheduler(LRScheduler):
+""" Reduce the learning rate by given a list of steps.
+
+Calculate the new learning rate by::
+
+   base_lr * (1-nup/max_nup)^pwr
+   if nup < max_nup, 0 otherwise.
+
+Parameters
+--
+   num_update: current number of updates
+   max_update: maximum number of updates before the decay reaches 0.
+   base_lr:base learning rate
+   pwr:   power of the decay term as a funtion of the current number of 
updates.
+
+"""
+
+def __init__(self, num_update, max_update, base_lr=0.01, pwr=2):
 
 Review comment:
   `num_update` useless here


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-26 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164265043
 
 

 ##
 File path: python/mxnet/lr_scheduler.py
 ##
 @@ -136,3 +136,38 @@ def __call__(self, num_update):
 else:
 return self.base_lr
 return self.base_lr
+
+class PolyScheduler(LRScheduler):
+""" Reduce the learning rate by given a list of steps.
+
+Calculate the new learning rate by::
+
+   base_lr * (1-nup/max_nup)^pwr
+   if nup < max_nup, 0 otherwise.
+
+Parameters
+--
+   num_update: current number of updates
+   max_update: maximum number of updates before the decay reaches 0.
+   base_lr:base learning rate
+   pwr:   power of the decay term as a funtion of the current number of 
updates.
+
+"""
+
+def __init__(self, num_update, max_update, base_lr=0.01, pwr=2):
+super(PolyScheduler, self).__init__(base_lr)
+assert isinstance(max_update, int)
+if max_update < 1:
+raise ValueError("maximum number of updates must be strictly 
positive")
+self.base_lr_orig = self.base_lr
+self.max_update = max_update
+self.power = pwr
+self.count = num_update
+self.base_lr = self.base_lr_orig
+
+def __call__(self, num_update):
+if num_update <= self.max_update:
+self.base_lr = self.base_lr_orig * pow(1.0 - float(num_update) / 
float(self.max_update),
+   self.power)
+self.count += 1
 
 Review comment:
   and here


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-26 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164265038
 
 

 ##
 File path: python/mxnet/lr_scheduler.py
 ##
 @@ -136,3 +136,38 @@ def __call__(self, num_update):
 else:
 return self.base_lr
 return self.base_lr
+
+class PolyScheduler(LRScheduler):
+""" Reduce the learning rate by given a list of steps.
+
+Calculate the new learning rate by::
+
+   base_lr * (1-nup/max_nup)^pwr
+   if nup < max_nup, 0 otherwise.
+
+Parameters
+--
+   num_update: current number of updates
+   max_update: maximum number of updates before the decay reaches 0.
+   base_lr:base learning rate
+   pwr:   power of the decay term as a funtion of the current number of 
updates.
+
+"""
+
+def __init__(self, num_update, max_update, base_lr=0.01, pwr=2):
+super(PolyScheduler, self).__init__(base_lr)
+assert isinstance(max_update, int)
+if max_update < 1:
+raise ValueError("maximum number of updates must be strictly 
positive")
+self.base_lr_orig = self.base_lr
+self.max_update = max_update
+self.power = pwr
+self.count = num_update
 
 Review comment:
   same for self.count


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-26 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164242034
 
 

 ##
 File path: python/mxnet/lr_scheduler.py
 ##
 @@ -136,3 +136,42 @@ def __call__(self, num_update):
 else:
 return self.base_lr
 return self.base_lr
+
+class PolyScheduler(LRScheduler):
+""" Reduce the learning rate by given a list of steps.
+
+Calculate the new learning rate by::
+
+   base_lr * (1-nup/max_nup)^pwr
+   if nup < max_nup, 0 otherwise.
+
+Parameters
+--
+   num_update: current number of updates
+   max_update: maximum number of updates before the decay reaches 0.
+   base_lr:base learning rate
+   pwr:   power of the decay term as a funtion of the current number of 
updates.
+
+"""
+
+def __init__(self, num_update, max_update, base_lr=0.01, pwr=2):
+super(PolyScheduler, self).__init__(base_lr)
+assert isinstance(max_update, int)
+if max_update < 1:
+raise ValueError("maximum number of updates must be strictly 
positive")
+self.base_lr_orig = self.base_lr
+self.max_update = max_update
+self.power = pwr
+self.count = num_update
+if num_update <= max_update:
 
 Review comment:
   This this duplicate with line 173. I understand it is for resume training, 
but that should be handled in `__call__`, see the example in 
MultiFactorScheduler. Therefore, num_update is not necessary in `__init__`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?

2018-01-12 Thread GitBox
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD 
with a warmup, and a LARS startegy. Also add?
URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r161338805
 
 

 ##
 File path: python/mxnet/optimizer.py
 ##
 @@ -531,6 +531,197 @@ def update_multi_precision(self, index, weight, grad, 
state):
 self._update_impl(index, weight, grad, state,
   multi_precision=use_multi_precision)
 
+@register
+class LBSGD(Optimizer):
+"""The Large Batch SGD optimizer with momentum and weight decay.
+
+The optimizer updates the weight by::
+
+state = momentum * state + lr * rescale_grad * clip(grad, 
clip_gradient) + wd * weight
+weight = weight - state
+
+For details of the update algorithm see 
:class:`~mxnet.ndarray.lbsgd_update` and
+:class:`~mxnet.ndarray.lbsgd_mom_update`.
+
+This optimizer accepts the following parameters in addition to those 
accepted
+by :class:`.Optimizer`.
+
+Parameters
+--
+momentum : float, optional
+   The momentum value.
+multi_precision: bool, optional
+   Flag to control the internal precision of the optimizer.
+   ``False`` results in using the same precision as the weights (default),
+   ``True`` makes internal 32-bit copy of the weights and applies gradients
+in 32-bit precision even if actual weights used in the model 
have lower precision.`<
+Turning this on can improve convergence and accuracy when 
training with float16.
+warmup_strategy: string ('linear', 'power', 'sqrt'. , 'lars'   default : 
'linear')
+warmup_epochs: unsigned, default: 5
+batch_scale:   unsigned, default: 1 (same as batch size*numworkers)
+updates_per_epoch: updates_per_epoch (default: 32, Default might not 
reflect true number batches per epoch. Used for warmup.)
+begin_epoch: unsigned, default 0, starting epoch.
+"""
+
+def __init__(self, momentum=0.0, multi_precision=False, 
warmup_strategy='linear',
+ warmup_epochs=5, batch_scale=1, updates_per_epoch=32, 
begin_epoch=0, num_epochs=60,
+ **kwargs):
+super(LBSGD, self).__init__(**kwargs)
+logging.info('Running Large-Batch SGD Algorithm')
+logging.info('(Batch_scale=%f, warmup_epochs=%d, warmup_strategy=%s, 
updates_per_epoch=%d)',
+ batch_scale, warmup_epochs, warmup_strategy, 
updates_per_epoch)
+self.momentum = momentum
+self.multi_precision = multi_precision
+# new user parameters for large batch
+self.warmup_strategy = warmup_strategy
+self.warmup_epochs = warmup_epochs
+self.batch_scale = batch_scale
+self.updates_per_epoch = updates_per_epoch
+self.init_updates = begin_epoch * updates_per_epoch
+self.num_epochs = num_epochs
+# addl internal usage parameters and storage
+self.lbmult = 1
+self.cumgrads = {}
+# for adaptive lr
+self.adaptive = False
+self.admult = 1  # adaptation constant
+
+def create_state(self, index, weight):
+momentum = None
+weight_master_copy = None
+if self.multi_precision and weight.dtype == numpy.float16:
+weight_master_copy = array(weight, ctx=weight.context, 
dtype=numpy.float32)
+if self.momentum != 0.0:
+momentum = zeros(weight.shape, weight.context, 
dtype=numpy.float32,
+ stype=weight.stype)
+return (momentum, weight_master_copy)
+if weight.dtype == numpy.float16 and not self.multi_precision:
+warnings.warn("Accumulating with float16 in optimizer can lead to "
+  "poor accuracy or slow convergence. "
+  "Consider using multi_precision=True option of the "
+  "SGD optimizer")
+if self.momentum != 0.0:
+momentum = zeros(weight.shape, weight.context, dtype=weight.dtype, 
stype=weight.stype)
+return momentum
+
+def _get_lbmult(self, nup):
+"""Returns lr scaling factor for large batch according to warmup 
schedule
+(to be implemented)
+"""
+nwup = self.warmup_epochs * self.updates_per_epoch
+strategy = self.warmup_strategy
+maxmult = float(self.batch_scale)
+if nup >= nwup:
+mult = maxmult
+elif nwup <= 1:
+mult = 1.0
+else:
+if (strategy == 'linear'):
+mult = 1.0 + (maxmult - 1) * nup / nwup
+elif (strategy == 'power2'):
+mult = 1.0 + (maxmult - 1) * (nup * nup) / (nwup * nwup)
+elif (strategy == 'power3'):
+mult = 1.0 + (maxmult - 1) * (nup * nup) / (nwup * nwup)
 
 Review comment:
   Power3 is wrong


This is an automated message from