[ 
https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090967#comment-16090967
 ] 

Matthias Boehm commented on SYSTEMML-1774:
------------------------------------------

well, of course I'm happy to help here but let's separate the individual issues 
first.

1) NPE in ConvolutionCPInstruction: [~niketanpansare] could you please have a 
look into this issue? The compiled -1 parameter is a bit suspicious. Anyway, it 
should not throw a nullpointer. Also, why is there a 
ConvolutionUtils.scalarOperations - these convolution operations should call 
the existing scalar operations.

2) Parfor REMOTE_SPARK: Just to be clear running in spark execution mode and 
forcing REMOTE_SPARK is an invalid configuration. We have the mechanisms to 
force the recompile to CP for all instructions in the parfor body but this does 
not apply for conflicting configurations.

The real issue here is the need to force spark and/or remote_spark at all. No 
library of dml scripts should force REMOTE_SPARK (other than for testing) 
because it can create many issues such as unnecessary OOMs. If there are 
limitations of size propagation which prevent us from compiling this 
automatically if beneficial, we should fix the underlying root cause. [~Tenma] 
and [~dusenberrymw] could you please provide the configuration of a scenario 
where REMOTE_SPARK was beneficial but not automatically chosen and I'll take 
care if it. 


> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
>                 Key: SYSTEMML-1774
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
>             Project: SystemML
>          Issue Type: Improvement
>          Components: Algorithms, Compiler, ParFor
>    Affects Versions: SystemML 1.0
>            Reporter: Fei Hu
>              Labels: deeplearning
>         Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt, 
> Explain_For_Spark_Mode.txt, MNIST_Distrib_Sgd.scala, 
> mnist_lenet_distrib_sgd.dml
>
>
> When running the  [distributed MNIST LeNet example | 
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
>  each mini-batch could ideally run in parallel without interaction. We try to 
> force {{parfor (j in 1:parallel_batches)}} at line 137 of 
> {{nn/examples/mnist_lenet_distrib_sgd.dml}} to be {{parfor (j in 
> 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use 
> {{REMOTE_SPARK}} mode, but got some errors about 
> {{org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions 
> of type other than CP instructions}} using the mode {{SPARK}}, and the error 
> {{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log 
> information can be found at the following comments. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to