[
https://issues.apache.org/jira/browse/SYSTEMML-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090967#comment-16090967
]
Matthias Boehm edited comment on SYSTEMML-1774 at 7/18/17 1:56 AM:
-------------------------------------------------------------------
well, of course I'm happy to help here but let's separate the individual issues
first.
1) NPE in ConvolutionCPInstruction: [~niketanpansare] could you please have a
look into this issue? The compiled -1 parameter is a bit suspicious. Anyway, it
should not throw a nullpointer. Also, why is there a
ConvolutionUtils.scalarOperations - these convolution operations should call
the existing scalar operations.
2) Parfor REMOTE_SPARK: Just to be clear running in spark execution mode and
forcing REMOTE_SPARK is an invalid configuration. We have the mechanisms to
force the recompile to CP for all instructions in the parfor body but this does
not apply for conflicting configurations.
The real issue here is the need to force spark and/or remote_spark at all. No
library of dml scripts should force REMOTE_SPARK (other than for testing)
because it can create many issues such as unnecessary OOMs or
counter-productive performance (e.g., in your configuration the driver has more
virtual cores than your remote executor). If there are limitations of size
propagation which prevent us from compiling this automatically if beneficial,
we should fix the underlying root cause. [~Tenma] and [~dusenberrymw] could you
please provide the configuration of a scenario where REMOTE_SPARK was
beneficial but not automatically chosen and I'll take care if it.
was (Author: mboehm7):
well, of course I'm happy to help here but let's separate the individual issues
first.
1) NPE in ConvolutionCPInstruction: [~niketanpansare] could you please have a
look into this issue? The compiled -1 parameter is a bit suspicious. Anyway, it
should not throw a nullpointer. Also, why is there a
ConvolutionUtils.scalarOperations - these convolution operations should call
the existing scalar operations.
2) Parfor REMOTE_SPARK: Just to be clear running in spark execution mode and
forcing REMOTE_SPARK is an invalid configuration. We have the mechanisms to
force the recompile to CP for all instructions in the parfor body but this does
not apply for conflicting configurations.
The real issue here is the need to force spark and/or remote_spark at all. No
library of dml scripts should force REMOTE_SPARK (other than for testing)
because it can create many issues such as unnecessary OOMs or
counter-productive performance (e.g., in your configuration the driver has more
vcores as your remote executor). If there are limitations of size propagation
which prevent us from compiling this automatically if beneficial, we should fix
the underlying root cause. [~Tenma] and [~dusenberrymw] could you please
provide the configuration of a scenario where REMOTE_SPARK was beneficial but
not automatically chosen and I'll take care if it.
> Improve Parfor parallelism for deep learning
> --------------------------------------------
>
> Key: SYSTEMML-1774
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1774
> Project: SystemML
> Issue Type: Improvement
> Components: Algorithms, Compiler, ParFor
> Affects Versions: SystemML 1.0
> Reporter: Fei Hu
> Labels: deeplearning
> Attachments: Explain_For_HYBRID_SPARK_Mode_With_ErrorInfo.txt,
> Explain_For_Spark_Mode.txt, MNIST_Distrib_Sgd.scala,
> mnist_lenet_distrib_sgd.dml
>
>
> When running the [distributed MNIST LeNet example |
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml],
> each mini-batch could ideally run in parallel without interaction. We try to
> force {{parfor (j in 1:parallel_batches)}} at line 137 of
> {{nn/examples/mnist_lenet_distrib_sgd.dml}} to be {{parfor (j in
> 1:parallel_batches, mode=REMOTE_SPARK, opt=CONSTRAINED)}} use
> {{REMOTE_SPARK}} mode, but got some errors about
> {{org.apache.sysml.runtime.DMLRuntimeException: Not supported: Instructions
> of type other than CP instructions}} using the mode {{SPARK}}, and the error
> {{java.lang.NullPointerException}} using the mode {{HYBRID_SPARK}}. More log
> information can be found at the following comments.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)