[
https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099275#comment-16099275
]
Fei Hu commented on SYSTEMML-1760:
----------------------------------
The following run time statistic is one of the run time examples by running the
distribute MNIST example on the Spark cluster. Note that the parfor result
merge really took much more time than the other parts. Is it reasonable? cc
[~mboehm7] [~dusenberrymw] [~niketanpansare]
Total elapsed time: 1624.575 sec.
Total compilation time: 0.000 sec.
{color:#d04437}Total execution time: 1624.575 sec.{color}
Number of compiled Spark inst: 188.
Number of executed Spark inst: 6.
Cache hits (Mem, WB, FS, HDFS): 481/0/0/288.
Cache writes (WB, FS, HDFS): 214/0/108.
Cache times (ACQr/m, RLS, EXP): 1043.481/0.002/0.017/18.529 sec.
HOP DAGs recompiled (PRED, SB): 0/13.
HOP DAGs recompile time: 0.049 sec.
Functions recompiled: 1.
Functions recompile time: 0.157 sec.
Spark ctx create time (lazy): 0.006 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
ParFor loops optimized: 6.
ParFor optimize time: 0.151 sec.
ParFor initialize time: 0.000 sec.
{color:#d04437}ParFor result merge time: 1077.574 sec.{color}
ParFor total update in-place: 0/0/0
Total JIT compile time: 60.426 sec.
Total JVM GC count: 138.
{color:#d04437}Total JVM GC time: 220.124 sec.{color}
> Improve engine robustness of distributed SGD training
> -----------------------------------------------------
>
> Key: SYSTEMML-1760
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1760
> Project: SystemML
> Issue Type: Improvement
> Components: Algorithms, Compiler, ParFor
> Reporter: Mike Dusenberry
> Assignee: Fei Hu
>
> Currently, we have a mathematical framework in place for training with
> distributed SGD in a [distributed MNIST LeNet example |
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml].
> This task aims to push this at scale to determine (1) the current behavior
> of the engine (i.e. does the optimizer actually run this in a distributed
> fashion, and (2) ways to improve the robustness and performance for this
> scenario. The distributed SGD framework from this example has already been
> ported into Caffe2DML, and thus improvements made for this task will directly
> benefit our efforts towards distributed training of Caffe models (and Keras
> in the future).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)