[ 
https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092458#comment-16092458
 ] 

Matthias Boehm edited comment on SYSTEMML-1774 at 7/19/17 1:40 AM:
-------------------------------------------------------------------

here are a couple of guesses: (1) the expensive operations are still ran in CP 
because distributed operations are globally disabled for any convolution ops 
(because they are experimental), (2) running concurrent spark operations fully 
exploits your cluster and not just a single node, and (3) potentially fewer 
evictions, given the very small driver and sparks lazy evaluation. 

For your experiments, I would recommend to run with reasonable driver sizes and 
a cluster of multiple nodes.


was (Author: mboehm7):
here are a couple of guesses: (1) the expensive operations are still ran in CP 
because distributed operations are globally disabled for any convolution ops 
(because they are experimental), (2) running concurrent spark operations fully 
exploits your cluster and not just a single node, and (3) potentially fewer 
evictions, given the very small driver and sparks lazy evaluation. I can spent 
a couple of hours later this week this to profile this.

> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, 
> Explain_For_Spark_Mode.txt, MNIST_Distrib_Sgd.scala, 
> mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | 
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
>  each mini-batch could ideally run in parallel without interaction. We try to 
> force {{parfor (j in 1:parallel_batches)}} at line 137 of 
> {{nn/examples/mnist_lenet_distrib_sgd.dml}} to be {{parfor (j in 
> 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use 
> {{REMOTE_SPARK}} mode, but got some errors about 
> {{org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions 
> of type other than CP instructions}} using the mode {{SPARK}}, and the error 
> {{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log 
> information can be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to