[jira] [Updated] (MADLIB-1268) Spike - CNN convergence, data parallel with merge

Frank McQuillan (JIRA) Tue, 07 Aug 2018 13:01:29 -0700


     [ 
https://issues.apache.org/jira/browse/MADLIB-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank McQuillan updated MADLIB-1268:
------------------------------------
    Description: 
Story

`As a MADlib developer`
I want investigate convergence behaviour when running a single distributed CNN 
model across the Greenplum cluster using Keras with a Tensorflow backend
`so that`
I can see if it converges in a predictable and expected way.


Details

* By "single distributed CNN model" I mean data parallel with merge (not model 
parallel).
* Does not need to use an aggregate for this spike, if that is too 
inconvenient, since performance is not the focus of this story.  It's about 
convergence.
* In defining the merge function, review [2] for single-server, multi-GPU merge 
function.  Perhaps we can do the exact same thing for multi-server?
* For dataset, consider MNIST and/or CIFAR-10.
* See page 11 of [8] re synchronous data parallel in TF


Acceptance

1) Plot characteristic curves of loss vs. iteration number.  Compare with 
MADlib merge (this story) vs. without merge.
2) Define what the merge function is for CNN.  Is it the same as [2] or 
something else? Does it operate on weights only or does it need gradients?
3) What does the architecture look like?  Draw a diagram showing sync/merge 
step for distributed model training.
4) What tests do we need to do to convince ourselves that the architecture is 
valid?  
5) Do we need to write different merge functions, or have a different approach, 
for each different neural net type algorithm?  Or is there a general approach 
that we can use that will apply to this class of algorithms?


References

[2] Check for “# Merge outputs under expected scope” section in the python 
program
 
https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py

[5] Single Machine Data Parallel multi GPU Training 
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/

[6] Why are GPUs necessary for training Deep Learning models?
https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/

[7] Deep Learning vs Classical Machine Learning
https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa

[8] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed 
Systems
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf

  was:

Story

`As a MADlib developer`
I want investigate convergence behaviour when running a single distributed CNN 
model across the Greenplum cluster using Keras with a Tensorflow backend
`so that`
I can see if it converges in a predictable and expected way.


Details

* By "single distributed CNN model" I mean data parallel with merge (not model 
parallel).
* Does not need to use an aggregate for this spike, if that is too 
inconvenient, since performance is not the focus of this story.  It's about 
convergence.
* In defining the merge function, review [2] for single-server, multi-GPU merge 
function.  Perhaps we can do the exact same thing for multi-server?
* For dataset, consider using Pavan's CNN code and data set [3].  Another 
option is MNIST and/or CIFAR-10.
* See page 11 of [8] re synchronous data parallel in TF


Acceptance

1) Plot characteristic curves of loss vs. iteration number.  Compare with 
MADlib merge (this story) vs. without merge.
2) Define what the merge function is for CNN.  Is it the same as [3] or 
something else? Does it operate on weights only or does it need gradients?
3) What does the architecture look like?  Draw a diagram showing sync/merge 
step for distributed model training.
4) What tests do we need to do to convince ourselves that the architecture is 
valid?  
5) Do we need to write different merge functions, or have a different approach, 
for each different neural net type algorithm?  Or is there a general approach 
that we can use that will apply to this class of algorithms?
6) Anything to learn from pg-strom [2] ?
7) Anything to learn from H20 [3] ?  I don’t think they are doing distributed 
training, rather grid search and such.


References

[2] Check for “# Merge outputs under expected scope” section in the python 
program
 
https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py

[3] Deep learning example for image segmentation
https://drive.google.com/drive/folders/1mgZPGuDP1JI1TUVaRndexDZTlSABLci9?usp=sharing

[4] Deep Learning & Greenplum Discussion (Pavan/Pivotal DE)
https://drive.google.com/file/d/1U808PAwMetNL38mrboPHdn8RKOrpQyBz/view?usp=sharing

[5] Single Machine Data Parallel multi GPU Training 
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/

[6] Why are GPUs necessary for training Deep Learning models?
https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/

[7] Deep Learning vs Classical Machine Learning
https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa

[8] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed 
Systems
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf


> Spike - CNN convergence, data parallel with merge
> -------------------------------------------------
>
>                 Key: MADLIB-1268
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1268
>             Project: Apache MADlib
>          Issue Type: New Feature
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v2.0
>
>
> Story
> `As a MADlib developer`
> I want investigate convergence behaviour when running a single distributed 
> CNN model across the Greenplum cluster using Keras with a Tensorflow backend
> `so that`
> I can see if it converges in a predictable and expected way.
> Details
> * By "single distributed CNN model" I mean data parallel with merge (not 
> model parallel).
> * Does not need to use an aggregate for this spike, if that is too 
> inconvenient, since performance is not the focus of this story.  It's about 
> convergence.
> * In defining the merge function, review [2] for single-server, multi-GPU 
> merge function.  Perhaps we can do the exact same thing for multi-server?
> * For dataset, consider MNIST and/or CIFAR-10.
> * See page 11 of [8] re synchronous data parallel in TF
> Acceptance
> 1) Plot characteristic curves of loss vs. iteration number.  Compare with 
> MADlib merge (this story) vs. without merge.
> 2) Define what the merge function is for CNN.  Is it the same as [2] or 
> something else? Does it operate on weights only or does it need gradients?
> 3) What does the architecture look like?  Draw a diagram showing sync/merge 
> step for distributed model training.
> 4) What tests do we need to do to convince ourselves that the architecture is 
> valid?  
> 5) Do we need to write different merge functions, or have a different 
> approach, for each different neural net type algorithm?  Or is there a 
> general approach that we can use that will apply to this class of algorithms?
> References
> [2] Check for “# Merge outputs under expected scope” section in the python 
> program
>  
> https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py
> [5] Single Machine Data Parallel multi GPU Training 
> https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
> [6] Why are GPUs necessary for training Deep Learning models?
> https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/
> [7] Deep Learning vs Classical Machine Learning
> https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa
> [8] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed 
> Systems
> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (MADLIB-1268) Spike - CNN convergence, data parallel with merge

Reply via email to