[
https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091975#comment-16091975
]
Fei Hu edited comment on SYSTEMML-1774 at 7/18/17 6:44 PM:
-----------------------------------------------------------
Our experiment plan to test the performance of the distributed MNIST LeNet
example is:
# HYBRID_SPARK + plain parfor : ~1.5 hours
# SPARK + palin parfor: ~30 mins
# HYBRID_SPARK + REMOTE_SPARK parfor: failed
# SPARK + REMOTE_SPARE parfor: failed
After we get the run time for these four scenarios, we may get some hints about
how to improve the performance of distributed SGD training.
Some new findings when running some experiments on the local machine using
{{HYBRID_SPARK + REMOTE_SPARK parfor}}:
* When using the attached scala file to run the example, we will get the errors
as shown in {{Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt}}.
* However, it seemed that the above errors in the local machine are related to
the {{batchSize}} and the memory size for Spark:
a) {{batchSize}}: when changing it from {{2}} to {{16}} as
{{val batchSize = 16}}, the errors disappeared.
b) Memory size for Spark:
* When setting {{batchSize}} to be {{4}}, the same
errors happened
* But when increasing the memory size for Spark to
{{11.8GB}} by {{conf.set("spark.testing.memory", memSize.toString)}} with
{{batchSize = 4}}, the errors disappeared.
was (Author: tenma):
Our experiment plan to test the performance of the distributed MNIST LeNet
example is:
# HYBRID_SPARK + plain parfor : ~1.5 hours
# SPARK + palin parfor: ~30 mins
# HYBRID_SPARK + REMOTE_SPARK parfor: failed
# SPARK + REMOTE_SPARE parfor: failed
After we get the run time for these four scenarios, we may get some hints about
how to improve the performance of distributed SGD training.
Some new findings when running some experiments on the local machine using
{{HYBRID_SPARK + REMOTE_SPARK parfor}}:
* When using the attached scala file to run the example, we will get the errors
as shown in {{Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt}}.
* However, it seemed that the above errors in the local machine are related to
the {{batchSize}} and the memory size for Spark:
a) {{batchSize}}: when changing it from {{2}} to {{16}} as
{{val batchSize = 16}}, the errors disappeared.
b) Memory size for Spark:
* When setting {{batchSize}} to be {{4}}, the same errors
happened
* But when increasing the memory size for Spark to
{{11.8GB}} by {{conf.set("spark.testing.memory", memSize.toString)}} with
{{batchSize = 4}}, the errors disappeared.
> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
> Key: SYSTEMML-1774
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
> Project: SystemML
> Issue Type: Improvement
> Components: Algorithms, Compiler, ParFor
> Affects Versions: SystemML 1.0
> Reporter: Fei Hu
> Labels: deeplearning
> Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt,
> Explain_For_Spark_Mode.txt, MNIST_Distrib_Sgd.scala,
> mnist_lenet_distrib_sgd.dml
>
>
> When running the [distributed MNIST LeNet example |
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
> each mini-batch could ideally run in parallel without interaction. We try to
> force {{parfor (j in 1:parallel_batches)}} at line 137 of
> {{nn/examples/mnist_lenet_distrib_sgd.dml}} to be {{parfor (j in
> 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use
> {{REMOTE_SPARK}} mode, but got some errors about
> {{org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions
> of type other than CP instructions}} using the mode {{SPARK}}, and the error
> {{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log
> information can be found at the following comments.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)