[
https://issues.apache.org/jira/browse/SINGA-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
wangwei updated SINGA-122:
--------------------------
Summary: Optimize memory space for Param sharing (was: Update Param class
on memory space controlling)
> Optimize memory space for Param sharing
> ---------------------------------------
>
> Key: SINGA-122
> URL: https://issues.apache.org/jira/browse/SINGA-122
> Project: Singa
> Issue Type: Improvement
> Reporter: wangwei
>
> In deep learning models, some layers would share one or more Param objects,
> e.g., the en-coder and de-coder in a auto-encoder model would share weight
> matrix, and RNN units (layers) would share their all parameters. For parallel
> training with replicated layers, these layers share parameters.
> It is necessary to optimise the memory space of these shared parameters.
> Otherwise, we have to allocate both memory space for the parameter values and
> gradients for each share. It would consume a lot of memory, especially for
> the RNN model case and the parallel training case (there would be more than
> 10 shares for one Param object).
> To minimise the memory footprint, there are three levels,
> 1. share memory space CPU values, e.g., one Param object is replicated on
> different GPU cards.
> 2. share memory space for both CPU values and GPU values
> 3. share memory space for both values and gradients.
> In terms of memory footprint, 1 > 2 > 3.
> for level 1 and 2, the code for computing gradients is transparent to
> parameter sharing. However, for case 3, the code must handle the gradient
> aggregation correctly. Otherwise, the gradients computed for one share would
> be overwritten by others.
> We need to update both the Param class and NeuralNet class (to decide the
> sharing case for Param objects). Generally, the NeuralNet class creates Param
> objects and determine the sharing level (together with user configuration).
> Layer::ComputeGradient assign or aggregate gradients based on the sharing
> level (e.g., a flag).
> Details will be updated later..
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)