Yeah I want us to look heavily into this problem in the context of deep 
learning algorithms.  I think we should plan on having first-class support for 
DL in our 1.0 release, including efficient (distributed SGD) training (+GPUs) 
and efficient distributed scoring.  Nice thing too is that when we achieve 
this, we'll end up benefiting most of our existing algorithms as well.

--

Mike Dusenberry
GitHub: github.com/dusenberrymw
LinkedIn: linkedin.com/in/mikedusenberry

Sent from my iPhone.


> On Feb 15, 2017, at 12:22 PM, Niketan Pansare <npan...@us.ibm.com> wrote:
> 
> Hi Matthias,
> 
> I am OK with removing this flag, but would prefer that we keep the JIRA open 
> until we are sure that caching is not a bottleneck. I have noticed that the 
> gradients turns to sparse as we execute more iterations. Also, cache release 
> time is dependent on the memory budget. Here are the statistics running Lenet 
> on MNIST using 
> https://github.com/apache/incubator-systemml/tree/master/scripts/staging/SystemML-NN/examples
> 
> With 20G driver memory, the statistics after running 10 epochs are as follows:
> Epoch: 10, Iter: 700, Train Loss: 0.20480149054528493, Train Accuracy: 
> 0.984375, Val Loss: 0.026928755962383588, Val Accuracy: 0.9922
> Epoch: 10, Iter: 800, Train Loss: 0.20165772217976913, Train Accuracy: 1.0, 
> Val Loss: 0.027878978005867083, Val Accuracy: 0.9922
> 17/02/14 16:06:58 INFO DMLScript: SystemML Statistics:
> Total elapsed time: 12687.863 sec.
> Total compilation time: 2.168 sec.
> Total execution time: 12685.694 sec.
> Number of compiled Spark inst: 147.
> Number of executed Spark inst: 4.
> Cache hits (Mem, WB, FS, HDFS): 1096424/0/0/2.
> Cache writes (WB, FS, HDFS): 603950/15/8.
> Cache times (ACQr/m, RLS, EXP): 3.704/0.336/61.831/1.242 sec.
> HOP DAGs recompiled (PRED, SB): 0/154885.
> HOP DAGs recompile time: 28.663 sec.
> Functions recompiled: 1.
> Functions recompile time: 0.024 sec.
> Spark ctx create time (lazy): 1.009 sec.
> Spark trans counts (par,bc,col):0/0/2.
> Spark trans times (par,bc,col): 0.000/0.000/3.433 secs.
> Total JIT compile time: 44.711 sec.
> Total JVM GC count: 7459.
> Total JVM GC time: 166.26 sec.
> Heavy hitter instructions (name, time, count):
> -- 1) train 12138.979 sec 1
> -- 2) conv2d_bias_add 10876.708 sec 17362
> -- 3) conv2d_backward_filter 421.303 sec 17200
> -- 4) sel+ 239.660 sec 25881
> -- 5) update 226.687 sec 68800
> -- 6) update_nesterov 223.775 sec 68800
> -- 7) maxpooling_backward 136.709 sec 17200
> -- 8) conv2d_backward_data 134.315 sec 8600
> -- 9) ba+* 118.897 sec 51762
> -- 10) relu_maxpooling 112.283 sec 17362
> -- 11) relu_backward 107.483 sec 34400
> -- 12) uack+ 89.258 sec 34400
> -- 13) r' 74.304 sec 43000
> -- 14) +* 57.193 sec 34400
> -- 15) * 16.493 sec 95178
> -- 16) rand 16.038 sec 8613
> -- 17) / 8.352 sec 86492
> -- 18) rangeReIndex 6.628 sec 17208
> -- 19) + 3.054 sec 96528
> -- 20) uark+ 2.219 sec 43241
> -- 21) sp_csvrblk 2.183 sec 2
> -- 22) rmvar 1.517 sec 1451571
> -- 23) write 1.250 sec 9
> -- 24) - 1.059 sec 86486
> -- 25) createvar 1.026 sec 587259
> -- 26) exp 0.663 sec 17281
> -- 27) *2 0.361 sec 2
> -- 28) uasqk+ 0.277 sec 320
> -- 29) log 0.200 sec 160
> -- 30) uarmax 0.191 sec 17281
> 
> With 5G driver memory, the statistics after running 10 epochs are as follows:
> Epoch: 10, Iter: 700, Train Loss: 0.19313544015858036, Train Accuracy: 1.0, 
> Val Loss: 0.025943927403263182, Val Accuracy: 0.993
> Epoch: 10, Iter: 800, Train Loss: 0.1883995965207449, Train Accuracy: 1.0, 
> Val Loss: 0.0260796819319468, Val Accuracy: 0.9916
> 17/02/14 20:16:40 INFO DMLScript: SystemML Statistics:
> Total elapsed time: 13886.763 sec.
> Total compilation time: 2.148 sec.
> Total execution time: 13884.615 sec.
> Number of compiled Spark inst: 147.
> Number of executed Spark inst: 4.
> Cache hits (Mem, WB, FS, HDFS): 1096422/0/2/2.
> Cache writes (WB, FS, HDFS): 603868/2176/8.
> Cache times (ACQr/m, RLS, EXP): 3.883/0.343/271.757/1.312 sec.
> HOP DAGs recompiled (PRED, SB): 0/154885.
> HOP DAGs recompile time: 28.290 sec.
> Functions recompiled: 1.
> Functions recompile time: 0.023 sec.
> Spark ctx create time (lazy): 0.981 sec.
> Spark trans counts (par,bc,col):0/0/2.
> Spark trans times (par,bc,col): 0.000/0.000/3.501 secs.
> Total JIT compile time: 45.131 sec.
> Total JVM GC count: 7605.
> Total JVM GC time: 157.716 sec.
> Heavy hitter instructions (name, time, count):
> -- 1) train 13301.811 sec 1
> -- 2) conv2d_bias_add 11890.291 sec 17362
> -- 3) conv2d_backward_filter 416.645 sec 17200
> -- 4) ba+* 252.966 sec 51762
> -- 5) sel+ 237.334 sec 25881
> -- 6) update 228.261 sec 68800
> -- 7) update_nesterov 225.383 sec 68800
> -- 8) maxpooling_backward 134.260 sec 17200
> -- 9) +* 133.959 sec 34400
> -- 10) conv2d_backward_data 128.046 sec 8600
> -- 11) relu_maxpooling 106.499 sec 17362
> -- 12) relu_backward 104.062 sec 34400
> -- 13) uack+ 90.104 sec 34400
> -- 14) r' 70.932 sec 43000
> -- 15) * 16.203 sec 95178
> -- 16) rand 16.131 sec 8613
> -- 17) / 7.988 sec 86492
> -- 18) rangeReIndex 7.640 sec 17208
> -- 19) sp_csvrblk 2.220 sec 2
> -- 20) + 2.121 sec 96528
> -- 21) uark+ 2.079 sec 43241
> -- 22) rmvar 1.580 sec 1451571
> -- 23) rshape 1.533 sec 17200
> -- 24) write 1.322 sec 9
> -- 25) createvar 0.976 sec 587259
> -- 26) - 0.961 sec 86486
> -- 27) exp 0.659 sec 17281
> -- 28) uasqk+ 0.314 sec 320
> -- 29) *2 0.312 sec 2
> -- 30) log 0.200 sec 160
> 
> Thanks,
> 
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
> 
> Matthias Boehm ---02/13/2017 04:29:12 PM---Well, I used exactly the 
> mnist_lenet scenario discussed in the JIRA, but what I've observed are evic
> 
> From: Matthias Boehm <mboe...@googlemail.com>
> To: dev@systemml.incubator.apache.org
> Date: 02/13/2017 04:29 PM
> Subject: Re: Removal of workaround flags
> 
> 
> 
> 
> Well, I used exactly the mnist_lenet scenario discussed in the JIRA, but
> what I've observed are eviction times <2.5% of total execution time, almost
> no sparse intermediates, and the script execution time being dominated by
> con2d_bias_add. Again, the discrepancy might very well stem from changes
> made since the JIRA was created.
> 
> In any case, I would rather address any existing performance issues than
> globally disabling evictions (which could easily lead to OOMs) or sparse
> matrix formats. Hence, I'd like to remove these workaround flags in order
> to prevent shortcuts that do not apply to all users.
> 
> Regards,
> Matthias
> 
> On Mon, Feb 13, 2017 at 9:19 AM, <dusenberr...@gmail.com> wrote:
> 
> > Thanks for bringing up the topic.  Our deep learning scripts (i.e.
> > algorithms with several intermediate transformations) have shown cache
> > release times to be a major bottleneck, thus leading to the creation of
> > SYSTEMML-1140.  Specifically, what did you use to attempt to reproduce 1140?
> >
> >
> > -Mike
> >
> > --
> >
> > Mike Dusenberry
> > GitHub: github.com/dusenberrymw
> > LinkedIn: linkedin.com/in/mikedusenberry
> >
> > Sent from my iPhone.
> >
> >
> > > On Feb 12, 2017, at 12:30 AM, Matthias Boehm <mboe...@googlemail.com>
> > wrote:
> > >
> > > SYSTEMML-1140
> >
> 
> 
> 

Reply via email to