[ 
https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16091874#comment-16091874
 ] 

Mike Dusenberry edited comment on SYSTEMML-1774 at 7/18/17 5:48 PM:
--------------------------------------------------------------------

Also, of course there *shouldn't* be any need to force the REMOTE_SPARK parfor 
mode, but that is exactly the reason for this JIRA issue -- for distributed SGD 
cases, a plain parfor isn't performing adequately, and this can serve as a 
scenario to explore improvements.  Instead of executing the parfor op in a 
distributed manner across Spark, a plain parfor is running on the driver, and 
thus forcing multithreaded convolution ops to run in single threaded mode, 
which is counterproductive.  [~Tenma] can comment more, but so far he found 
that a HYBRID_SPARK + plain parfor setup takes ~1.5 hours, while a SPARK + 
plain parfor takes ~30 mins.  HYBRID_SPARK + REMOTE_SPARK parfor currently 
fails, but I imagine it should be faster than 30 mins.


was (Author: mwdus...@us.ibm.com):
Also, of course there *shouldn't* be any need to force the REMOTE_SPARK parfor 
mode, but that is exactly the reason for this JIRA issue -- for distributed SGD 
cases, a plain parfor isn't performing adequately.  Instead of executing the 
parfor op in a distributed manner across Spark, a plain parfor is running on 
the driver, and thus forcing multithreaded convolution ops to run in single 
threaded mode, which is counterproductive.  [~Tenma] can comment more, but so 
far he found that a HYBRID_SPARK + plain parfor setup takes ~1.5 hours, while a 
SPARK + plain parfor takes ~30 mins.  HYBRID_SPARK + REMOTE_SPARK parfor 
currently fails, but I imagine it should be faster than 30 mins.

> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, 
> Explain_For_Spark_Mode.txt, MNIST_Distrib_Sgd.scala, 
> mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | 
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
>  each mini-batch could ideally run in parallel without interaction. We try to 
> force {{parfor (j in 1:parallel_batches)}} at line 137 of 
> {{nn/examples/mnist_lenet_distrib_sgd.dml}} to be {{parfor (j in 
> 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use 
> {{REMOTE_SPARK}} mode, but got some errors about 
> {{org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions 
> of type other than CP instructions}} using the mode {{SPARK}}, and the error 
> {{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log 
> information can be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to