[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164556307 ## File path: python/mxnet/optimizer.py ## @@ -645,6 +645,195 @@ def update(self, index, weight, grad, state): ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight, lr=lr, wd=wd, **kwargs) +@register +class LBSGD(Optimizer): +"""The Large Batch SGD optimizer with momentum and weight decay. + +The optimizer updates the weight by:: + +state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight +weight = weight - state + +For details of the update algorithm see :class:`~mxnet.ndarray.lbsgd_update` and +:class:`~mxnet.ndarray.lbsgd_mom_update`. + +This optimizer accepts the following parameters in addition to those accepted +by :class:`.Optimizer`. + +Parameters +-- +momentum : float, optional + The momentum value. +multi_precision: bool, optional + Flag to control the internal precision of the optimizer. + ``False`` results in using the same precision as the weights (default), + ``True`` makes internal 32-bit copy of the weights and applies gradients +in 32-bit precision even if actual weights used in the model have lower precision.`< +Turning this on can improve convergence and accuracy when training with float16. +warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars' default : 'linear') +warmup_epochs: unsigned, default: 5 +batch_scale: unsigned, default: 1 (same as batch size*numworkers) +updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.) +begin_epoch: unsigned, default 0, starting epoch. Review comment: @ashokei please add more details describing the strategy. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164555827 ## File path: python/mxnet/optimizer.py ## @@ -645,6 +645,195 @@ def update(self, index, weight, grad, state): ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight, lr=lr, wd=wd, **kwargs) +@register +class LBSGD(Optimizer): +"""The Large Batch SGD optimizer with momentum and weight decay. + +The optimizer updates the weight by:: + +state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight +weight = weight - state + +For details of the update algorithm see :class:`~mxnet.ndarray.lbsgd_update` and +:class:`~mxnet.ndarray.lbsgd_mom_update`. + +This optimizer accepts the following parameters in addition to those accepted +by :class:`.Optimizer`. + +Parameters +-- +momentum : float, optional + The momentum value. +multi_precision: bool, optional + Flag to control the internal precision of the optimizer. + ``False`` results in using the same precision as the weights (default), + ``True`` makes internal 32-bit copy of the weights and applies gradients +in 32-bit precision even if actual weights used in the model have lower precision.`< +Turning this on can improve convergence and accuracy when training with float16. +warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars' default : 'linear') +warmup_epochs: unsigned, default: 5 +batch_scale: unsigned, default: 1 (same as batch size*numworkers) +updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.) Review comment: I guess it requires the epoch number to stop warming up, which does not depend on the number of updates. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164555449 ## File path: python/mxnet/optimizer.py ## @@ -645,6 +645,195 @@ def update(self, index, weight, grad, state): ftml_update(weight, grad, prev_d, prev_v, prev_z, out=weight, lr=lr, wd=wd, **kwargs) +@register +class LBSGD(Optimizer): +"""The Large Batch SGD optimizer with momentum and weight decay. + +The optimizer updates the weight by:: + +state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight +weight = weight - state + +For details of the update algorithm see :class:`~mxnet.ndarray.lbsgd_update` and +:class:`~mxnet.ndarray.lbsgd_mom_update`. + +This optimizer accepts the following parameters in addition to those accepted +by :class:`.Optimizer`. + +Parameters +-- +momentum : float, optional + The momentum value. +multi_precision: bool, optional + Flag to control the internal precision of the optimizer. + ``False`` results in using the same precision as the weights (default), + ``True`` makes internal 32-bit copy of the weights and applies gradients +in 32-bit precision even if actual weights used in the model have lower precision.`< +Turning this on can improve convergence and accuracy when training with float16. +warmup_strategy: string ('linear', 'power2', 'sqrt'. , 'lars' default : 'linear') +warmup_epochs: unsigned, default: 5 +batch_scale: unsigned, default: 1 (same as batch size*numworkers) +updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.) +begin_epoch: unsigned, default 0, starting epoch. +""" + +def __init__(self, momentum=0.0, multi_precision=False, warmup_strategy='linear', + warmup_epochs=5, batch_scale=1, updates_per_epoch=32, begin_epoch=0, num_epochs=60, + **kwargs): +super(LBSGD, self).__init__(**kwargs) +logging.info('Running Large-Batch SGD Algorithm') +logging.info('(Batch_scale=%f, warmup_epochs=%d, warmup_strategy=%s, updates_per_epoch=%d)', + batch_scale, warmup_epochs, warmup_strategy, updates_per_epoch) +self.momentum = momentum +self.multi_precision = multi_precision +# new user parameters for large batch +self.warmup_strategy = warmup_strategy +self.warmup_epochs = warmup_epochs +self.batch_scale = batch_scale +self.updates_per_epoch = updates_per_epoch +self.init_updates = begin_epoch * updates_per_epoch +self.num_epochs = num_epochs +# addl internal usage parameters and storage +self.lbmult = 1 +self.cumgrads = {} +# for adaptive lr +self.adaptive = False +self.admult = 1 # adaptation constant + +def create_state(self, index, weight): Review comment: @ashokei As suggested, could you change to inherit SGD and override `create_state_multi_precision`, `create_state`, `update`, `update_multi_precision` only if necessary. Seems like you are mixing multi_precision part into the normal one. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164265033 ## File path: python/mxnet/lr_scheduler.py ## @@ -136,3 +136,38 @@ def __call__(self, num_update): else: return self.base_lr return self.base_lr + +class PolyScheduler(LRScheduler): +""" Reduce the learning rate by given a list of steps. + +Calculate the new learning rate by:: + + base_lr * (1-nup/max_nup)^pwr + if nup < max_nup, 0 otherwise. + +Parameters +-- + num_update: current number of updates + max_update: maximum number of updates before the decay reaches 0. + base_lr:base learning rate + pwr: power of the decay term as a funtion of the current number of updates. + +""" + +def __init__(self, num_update, max_update, base_lr=0.01, pwr=2): Review comment: `num_update` useless here This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164265043 ## File path: python/mxnet/lr_scheduler.py ## @@ -136,3 +136,38 @@ def __call__(self, num_update): else: return self.base_lr return self.base_lr + +class PolyScheduler(LRScheduler): +""" Reduce the learning rate by given a list of steps. + +Calculate the new learning rate by:: + + base_lr * (1-nup/max_nup)^pwr + if nup < max_nup, 0 otherwise. + +Parameters +-- + num_update: current number of updates + max_update: maximum number of updates before the decay reaches 0. + base_lr:base learning rate + pwr: power of the decay term as a funtion of the current number of updates. + +""" + +def __init__(self, num_update, max_update, base_lr=0.01, pwr=2): +super(PolyScheduler, self).__init__(base_lr) +assert isinstance(max_update, int) +if max_update < 1: +raise ValueError("maximum number of updates must be strictly positive") +self.base_lr_orig = self.base_lr +self.max_update = max_update +self.power = pwr +self.count = num_update +self.base_lr = self.base_lr_orig + +def __call__(self, num_update): +if num_update <= self.max_update: +self.base_lr = self.base_lr_orig * pow(1.0 - float(num_update) / float(self.max_update), + self.power) +self.count += 1 Review comment: and here This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164265038 ## File path: python/mxnet/lr_scheduler.py ## @@ -136,3 +136,38 @@ def __call__(self, num_update): else: return self.base_lr return self.base_lr + +class PolyScheduler(LRScheduler): +""" Reduce the learning rate by given a list of steps. + +Calculate the new learning rate by:: + + base_lr * (1-nup/max_nup)^pwr + if nup < max_nup, 0 otherwise. + +Parameters +-- + num_update: current number of updates + max_update: maximum number of updates before the decay reaches 0. + base_lr:base learning rate + pwr: power of the decay term as a funtion of the current number of updates. + +""" + +def __init__(self, num_update, max_update, base_lr=0.01, pwr=2): +super(PolyScheduler, self).__init__(base_lr) +assert isinstance(max_update, int) +if max_update < 1: +raise ValueError("maximum number of updates must be strictly positive") +self.base_lr_orig = self.base_lr +self.max_update = max_update +self.power = pwr +self.count = num_update Review comment: same for self.count This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r164242034 ## File path: python/mxnet/lr_scheduler.py ## @@ -136,3 +136,42 @@ def __call__(self, num_update): else: return self.base_lr return self.base_lr + +class PolyScheduler(LRScheduler): +""" Reduce the learning rate by given a list of steps. + +Calculate the new learning rate by:: + + base_lr * (1-nup/max_nup)^pwr + if nup < max_nup, 0 otherwise. + +Parameters +-- + num_update: current number of updates + max_update: maximum number of updates before the decay reaches 0. + base_lr:base learning rate + pwr: power of the decay term as a funtion of the current number of updates. + +""" + +def __init__(self, num_update, max_update, base_lr=0.01, pwr=2): +super(PolyScheduler, self).__init__(base_lr) +assert isinstance(max_update, int) +if max_update < 1: +raise ValueError("maximum number of updates must be strictly positive") +self.base_lr_orig = self.base_lr +self.max_update = max_update +self.power = pwr +self.count = num_update +if num_update <= max_update: Review comment: This this duplicate with line 173. I understand it is for resume training, but that should be handled in `__call__`, see the example in MultiFactorScheduler. Therefore, num_update is not necessary in `__init__` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add?
zhreshold commented on a change in pull request #8918: Added in Large-Batch SGD with a warmup, and a LARS startegy. Also add? URL: https://github.com/apache/incubator-mxnet/pull/8918#discussion_r161338805 ## File path: python/mxnet/optimizer.py ## @@ -531,6 +531,197 @@ def update_multi_precision(self, index, weight, grad, state): self._update_impl(index, weight, grad, state, multi_precision=use_multi_precision) +@register +class LBSGD(Optimizer): +"""The Large Batch SGD optimizer with momentum and weight decay. + +The optimizer updates the weight by:: + +state = momentum * state + lr * rescale_grad * clip(grad, clip_gradient) + wd * weight +weight = weight - state + +For details of the update algorithm see :class:`~mxnet.ndarray.lbsgd_update` and +:class:`~mxnet.ndarray.lbsgd_mom_update`. + +This optimizer accepts the following parameters in addition to those accepted +by :class:`.Optimizer`. + +Parameters +-- +momentum : float, optional + The momentum value. +multi_precision: bool, optional + Flag to control the internal precision of the optimizer. + ``False`` results in using the same precision as the weights (default), + ``True`` makes internal 32-bit copy of the weights and applies gradients +in 32-bit precision even if actual weights used in the model have lower precision.`< +Turning this on can improve convergence and accuracy when training with float16. +warmup_strategy: string ('linear', 'power', 'sqrt'. , 'lars' default : 'linear') +warmup_epochs: unsigned, default: 5 +batch_scale: unsigned, default: 1 (same as batch size*numworkers) +updates_per_epoch: updates_per_epoch (default: 32, Default might not reflect true number batches per epoch. Used for warmup.) +begin_epoch: unsigned, default 0, starting epoch. +""" + +def __init__(self, momentum=0.0, multi_precision=False, warmup_strategy='linear', + warmup_epochs=5, batch_scale=1, updates_per_epoch=32, begin_epoch=0, num_epochs=60, + **kwargs): +super(LBSGD, self).__init__(**kwargs) +logging.info('Running Large-Batch SGD Algorithm') +logging.info('(Batch_scale=%f, warmup_epochs=%d, warmup_strategy=%s, updates_per_epoch=%d)', + batch_scale, warmup_epochs, warmup_strategy, updates_per_epoch) +self.momentum = momentum +self.multi_precision = multi_precision +# new user parameters for large batch +self.warmup_strategy = warmup_strategy +self.warmup_epochs = warmup_epochs +self.batch_scale = batch_scale +self.updates_per_epoch = updates_per_epoch +self.init_updates = begin_epoch * updates_per_epoch +self.num_epochs = num_epochs +# addl internal usage parameters and storage +self.lbmult = 1 +self.cumgrads = {} +# for adaptive lr +self.adaptive = False +self.admult = 1 # adaptation constant + +def create_state(self, index, weight): +momentum = None +weight_master_copy = None +if self.multi_precision and weight.dtype == numpy.float16: +weight_master_copy = array(weight, ctx=weight.context, dtype=numpy.float32) +if self.momentum != 0.0: +momentum = zeros(weight.shape, weight.context, dtype=numpy.float32, + stype=weight.stype) +return (momentum, weight_master_copy) +if weight.dtype == numpy.float16 and not self.multi_precision: +warnings.warn("Accumulating with float16 in optimizer can lead to " + "poor accuracy or slow convergence. " + "Consider using multi_precision=True option of the " + "SGD optimizer") +if self.momentum != 0.0: +momentum = zeros(weight.shape, weight.context, dtype=weight.dtype, stype=weight.stype) +return momentum + +def _get_lbmult(self, nup): +"""Returns lr scaling factor for large batch according to warmup schedule +(to be implemented) +""" +nwup = self.warmup_epochs * self.updates_per_epoch +strategy = self.warmup_strategy +maxmult = float(self.batch_scale) +if nup >= nwup: +mult = maxmult +elif nwup <= 1: +mult = 1.0 +else: +if (strategy == 'linear'): +mult = 1.0 + (maxmult - 1) * nup / nwup +elif (strategy == 'power2'): +mult = 1.0 + (maxmult - 1) * (nup * nup) / (nwup * nwup) +elif (strategy == 'power3'): +mult = 1.0 + (maxmult - 1) * (nup * nup) / (nwup * nwup) Review comment: Power3 is wrong This is an automated message from