[jira] [Comment Edited] (SYSTEMML-1774) Improve Parfor parallelism for deep learning

Fei Hu (JIRA) Tue, 18 Jul 2017 11:45:32 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091975#comment-16091975
 ]


Fei Hu edited comment on SYSTEMML-1774 at 7/18/17 6:44 PM:
-----------------------------------------------------------

Our experiment plan to test the performance of the distributed MNIST LeNet 
example is:
# HYBRID_SPARK + plain parfor : ~1.5 hours
# SPARK + palin parfor: ~30 mins
# HYBRID_SPARK + REMOTE_SPARK parfor: failed
# SPARK + REMOTE_SPARE parfor: failed

After we get the run time for these four scenarios, we may get some hints about 
how to improve the performance of distributed SGD training. 

Some new findings when running some experiments on the local machine using 
{{HYBRID_SPARK + REMOTE_SPARK parfor}}:
* When using the attached scala file to run the example, we will get the errors 
as shown in {{Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt}}.
* However, it seemed that the above errors in the local machine are related to 
the {{batchSize}} and the memory size for Spark:
                 a) {{batchSize}}: when changing it from {{2}} to {{16}} as 
{{val batchSize = 16}}, the errors disappeared.
                 b) Memory size for Spark:
                          * When setting {{batchSize}} to be {{4}}, the same 
errors happened
                          * But when increasing the memory size for Spark to 
{{11.8GB}} by {{conf.set("spark.testing.memory", memSize.toString)}} with 
{{batchSize = 4}}, the errors disappeared.   









was (Author: tenma):
Our experiment plan to test the performance of the distributed MNIST LeNet 
example is:
# HYBRID_SPARK + plain parfor : ~1.5 hours
# SPARK + palin parfor: ~30 mins
# HYBRID_SPARK + REMOTE_SPARK parfor: failed
# SPARK + REMOTE_SPARE parfor: failed

After we get the run time for these four scenarios, we may get some hints about 
how to improve the performance of distributed SGD training. 

Some new findings when running some experiments on the local machine using 
{{HYBRID_SPARK + REMOTE_SPARK parfor}}:
* When using the attached scala file to run the example, we will get the errors 
as shown in {{Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt}}.
* However, it seemed that the above errors in the local machine are related to 
the {{batchSize}} and the memory size for Spark:
                 a) {{batchSize}}: when changing it from {{2}} to {{16}} as 
{{val batchSize = 16}}, the errors disappeared.
                 b) Memory size for Spark:
                      * When setting {{batchSize}} to be {{4}}, the same errors 
happened
                      * But when increasing the memory size for Spark to 
{{11.8GB}} by {{conf.set("spark.testing.memory", memSize.toString)}} with 
{{batchSize = 4}}, the errors disappeared.   








> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, 
> Explain_For_Spark_Mode.txt, MNIST_Distrib_Sgd.scala, 
> mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | 
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
>  each mini-batch could ideally run in parallel without interaction. We try to 
> force {{parfor (j in 1:parallel_batches)}} at line 137 of 
> {{nn/examples/mnist_lenet_distrib_sgd.dml}} to be {{parfor (j in 
> 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use 
> {{REMOTE_SPARK}} mode, but got some errors about 
> {{org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions 
> of type other than CP instructions}} using the mode {{SPARK}}, and the error 
> {{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log 
> information can be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (SYSTEMML-1774) Improve Parfor parallelism for deep learning

Reply via email to