Hi Matthias,

I am OK with removing this flag, but would prefer that we keep the JIRA
open until we are sure that caching is not a bottleneck. I have noticed
that the gradients turns to sparse as we execute more iterations. Also,
cache release time is dependent on the memory budget. Here are the
statistics running Lenet on MNIST using
https://github.com/apache/incubator-systemml/tree/master/scripts/staging/SystemML-NN/examples

With 20G driver memory, the statistics after running 10 epochs are as
follows:
Epoch: 10, Iter: 700, Train Loss: 0.20480149054528493, Train Accuracy:
0.984375, Val Loss: 0.026928755962383588, Val Accuracy: 0.9922
Epoch: 10, Iter: 800, Train Loss: 0.20165772217976913, Train Accuracy: 1.0,
Val Loss: 0.027878978005867083, Val Accuracy: 0.9922
17/02/14 16:06:58 INFO DMLScript: SystemML Statistics:
Total elapsed time:             12687.863 sec.
Total compilation time:         2.168 sec.
Total execution time:           12685.694 sec.
Number of compiled Spark inst:  147.
Number of executed Spark inst:  4.
Cache hits (Mem, WB, FS, HDFS): 1096424/0/0/2.
Cache writes (WB, FS, HDFS):    603950/15/8.
Cache times (ACQr/m, RLS, EXP): 3.704/0.336/61.831/1.242 sec.
HOP DAGs recompiled (PRED, SB): 0/154885.
HOP DAGs recompile time:        28.663 sec.
Functions recompiled:           1.
Functions recompile time:       0.024 sec.
Spark ctx create time (lazy):   1.009 sec.
Spark trans counts (par,bc,col):0/0/2.
Spark trans times (par,bc,col): 0.000/0.000/3.433 secs.
Total JIT compile time:         44.711 sec.
Total JVM GC count:             7459.
Total JVM GC time:              166.26 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   12138.979 sec   1
-- 2)   conv2d_bias_add         10876.708 sec   17362
-- 3)   conv2d_backward_filter  421.303 sec     17200
-- 4)   sel+    239.660 sec     25881
-- 5)   update  226.687 sec     68800
-- 6)   update_nesterov         223.775 sec     68800
-- 7)   maxpooling_backward     136.709 sec     17200
-- 8)   conv2d_backward_data    134.315 sec     8600
-- 9)   ba+*    118.897 sec     51762
-- 10)  relu_maxpooling         112.283 sec     17362
-- 11)  relu_backward   107.483 sec     34400
-- 12)  uack+   89.258 sec      34400
-- 13)  r'      74.304 sec      43000
-- 14)  +*      57.193 sec      34400
-- 15)  *       16.493 sec      95178
-- 16)  rand    16.038 sec      8613
-- 17)  /       8.352 sec       86492
-- 18)  rangeReIndex    6.628 sec       17208
-- 19)  +       3.054 sec       96528
-- 20)  uark+   2.219 sec       43241
-- 21)  sp_csvrblk      2.183 sec       2
-- 22)  rmvar   1.517 sec       1451571
-- 23)  write   1.250 sec       9
-- 24)  -       1.059 sec       86486
-- 25)  createvar       1.026 sec       587259
-- 26)  exp     0.663 sec       17281
-- 27)  *2      0.361 sec       2
-- 28)  uasqk+  0.277 sec       320
-- 29)  log     0.200 sec       160
-- 30)  uarmax  0.191 sec       17281

With 5G driver memory, the statistics after running 10 epochs are as
follows:
Epoch: 10, Iter: 700, Train Loss: 0.19313544015858036, Train Accuracy: 1.0,
Val Loss: 0.025943927403263182, Val Accuracy: 0.993
Epoch: 10, Iter: 800, Train Loss: 0.1883995965207449, Train Accuracy: 1.0,
Val Loss: 0.0260796819319468, Val Accuracy: 0.9916
17/02/14 20:16:40 INFO DMLScript: SystemML Statistics:
Total elapsed time:             13886.763 sec.
Total compilation time:         2.148 sec.
Total execution time:           13884.615 sec.
Number of compiled Spark inst:  147.
Number of executed Spark inst:  4.
Cache hits (Mem, WB, FS, HDFS): 1096422/0/2/2.
Cache writes (WB, FS, HDFS):    603868/2176/8.
Cache times (ACQr/m, RLS, EXP): 3.883/0.343/271.757/1.312 sec.
HOP DAGs recompiled (PRED, SB): 0/154885.
HOP DAGs recompile time:        28.290 sec.
Functions recompiled:           1.
Functions recompile time:       0.023 sec.
Spark ctx create time (lazy):   0.981 sec.
Spark trans counts (par,bc,col):0/0/2.
Spark trans times (par,bc,col): 0.000/0.000/3.501 secs.
Total JIT compile time:         45.131 sec.
Total JVM GC count:             7605.
Total JVM GC time:              157.716 sec.
Heavy hitter instructions (name, time, count):
-- 1)   train   13301.811 sec   1
-- 2)   conv2d_bias_add         11890.291 sec   17362
-- 3)   conv2d_backward_filter  416.645 sec     17200
-- 4)   ba+*    252.966 sec     51762
-- 5)   sel+    237.334 sec     25881
-- 6)   update  228.261 sec     68800
-- 7)   update_nesterov         225.383 sec     68800
-- 8)   maxpooling_backward     134.260 sec     17200
-- 9)   +*      133.959 sec     34400
-- 10)  conv2d_backward_data    128.046 sec     8600
-- 11)  relu_maxpooling         106.499 sec     17362
-- 12)  relu_backward   104.062 sec     34400
-- 13)  uack+   90.104 sec      34400
-- 14)  r'      70.932 sec      43000
-- 15)  *       16.203 sec      95178
-- 16)  rand    16.131 sec      8613
-- 17)  /       7.988 sec       86492
-- 18)  rangeReIndex    7.640 sec       17208
-- 19)  sp_csvrblk      2.220 sec       2
-- 20)  +       2.121 sec       96528
-- 21)  uark+   2.079 sec       43241
-- 22)  rmvar   1.580 sec       1451571
-- 23)  rshape  1.533 sec       17200
-- 24)  write   1.322 sec       9
-- 25)  createvar       0.976 sec       587259
-- 26)  -       0.961 sec       86486
-- 27)  exp     0.659 sec       17281
-- 28)  uasqk+  0.314 sec       320
-- 29)  *2      0.312 sec       2
-- 30)  log     0.200 sec       160

Thanks,

Niketan Pansare
IBM Almaden Research Center
E-mail: npansar At us.ibm.com
http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar



From:   Matthias Boehm <mboe...@googlemail.com>
To:     dev@systemml.incubator.apache.org
Date:   02/13/2017 04:29 PM
Subject:        Re: Removal of workaround flags



Well, I used exactly the mnist_lenet scenario discussed in the JIRA, but
what I've observed are eviction times <2.5% of total execution time, almost
no sparse intermediates, and the script execution time being dominated by
con2d_bias_add. Again, the discrepancy might very well stem from changes
made since the JIRA was created.

In any case, I would rather address any existing performance issues than
globally disabling evictions (which could easily lead to OOMs) or sparse
matrix formats. Hence, I'd like to remove these workaround flags in order
to prevent shortcuts that do not apply to all users.

Regards,
Matthias

On Mon, Feb 13, 2017 at 9:19 AM, <dusenberr...@gmail.com> wrote:

> Thanks for bringing up the topic.  Our deep learning scripts (i.e.
> algorithms with several intermediate transformations) have shown cache
> release times to be a major bottleneck, thus leading to the creation of
> SYSTEMML-1140.  Specifically, what did you use to attempt to reproduce
1140?
>
>
> -Mike
>
> --
>
> Mike Dusenberry
> GitHub: github.com/dusenberrymw
> LinkedIn: linkedin.com/in/mikedusenberry
>
> Sent from my iPhone.
>
>
> > On Feb 12, 2017, at 12:30 AM, Matthias Boehm <mboe...@googlemail.com>
> wrote:
> >
> > SYSTEMML-1140
>


Reply via email to