Yeah I want us to look heavily into this problem in the context of deep learning algorithms. I think we should plan on having first-class support for DL in our 1.0 release, including efficient (distributed SGD) training (+GPUs) and efficient distributed scoring. Nice thing too is that when we achieve this, we'll end up benefiting most of our existing algorithms as well.
-- Mike Dusenberry GitHub: github.com/dusenberrymw LinkedIn: linkedin.com/in/mikedusenberry Sent from my iPhone. > On Feb 15, 2017, at 12:22 PM, Niketan Pansare <npan...@us.ibm.com> wrote: > > Hi Matthias, > > I am OK with removing this flag, but would prefer that we keep the JIRA open > until we are sure that caching is not a bottleneck. I have noticed that the > gradients turns to sparse as we execute more iterations. Also, cache release > time is dependent on the memory budget. Here are the statistics running Lenet > on MNIST using > https://github.com/apache/incubator-systemml/tree/master/scripts/staging/SystemML-NN/examples > > With 20G driver memory, the statistics after running 10 epochs are as follows: > Epoch: 10, Iter: 700, Train Loss: 0.20480149054528493, Train Accuracy: > 0.984375, Val Loss: 0.026928755962383588, Val Accuracy: 0.9922 > Epoch: 10, Iter: 800, Train Loss: 0.20165772217976913, Train Accuracy: 1.0, > Val Loss: 0.027878978005867083, Val Accuracy: 0.9922 > 17/02/14 16:06:58 INFO DMLScript: SystemML Statistics: > Total elapsed time: 12687.863 sec. > Total compilation time: 2.168 sec. > Total execution time: 12685.694 sec. > Number of compiled Spark inst: 147. > Number of executed Spark inst: 4. > Cache hits (Mem, WB, FS, HDFS): 1096424/0/0/2. > Cache writes (WB, FS, HDFS): 603950/15/8. > Cache times (ACQr/m, RLS, EXP): 3.704/0.336/61.831/1.242 sec. > HOP DAGs recompiled (PRED, SB): 0/154885. > HOP DAGs recompile time: 28.663 sec. > Functions recompiled: 1. > Functions recompile time: 0.024 sec. > Spark ctx create time (lazy): 1.009 sec. > Spark trans counts (par,bc,col):0/0/2. > Spark trans times (par,bc,col): 0.000/0.000/3.433 secs. > Total JIT compile time: 44.711 sec. > Total JVM GC count: 7459. > Total JVM GC time: 166.26 sec. > Heavy hitter instructions (name, time, count): > -- 1) train 12138.979 sec 1 > -- 2) conv2d_bias_add 10876.708 sec 17362 > -- 3) conv2d_backward_filter 421.303 sec 17200 > -- 4) sel+ 239.660 sec 25881 > -- 5) update 226.687 sec 68800 > -- 6) update_nesterov 223.775 sec 68800 > -- 7) maxpooling_backward 136.709 sec 17200 > -- 8) conv2d_backward_data 134.315 sec 8600 > -- 9) ba+* 118.897 sec 51762 > -- 10) relu_maxpooling 112.283 sec 17362 > -- 11) relu_backward 107.483 sec 34400 > -- 12) uack+ 89.258 sec 34400 > -- 13) r' 74.304 sec 43000 > -- 14) +* 57.193 sec 34400 > -- 15) * 16.493 sec 95178 > -- 16) rand 16.038 sec 8613 > -- 17) / 8.352 sec 86492 > -- 18) rangeReIndex 6.628 sec 17208 > -- 19) + 3.054 sec 96528 > -- 20) uark+ 2.219 sec 43241 > -- 21) sp_csvrblk 2.183 sec 2 > -- 22) rmvar 1.517 sec 1451571 > -- 23) write 1.250 sec 9 > -- 24) - 1.059 sec 86486 > -- 25) createvar 1.026 sec 587259 > -- 26) exp 0.663 sec 17281 > -- 27) *2 0.361 sec 2 > -- 28) uasqk+ 0.277 sec 320 > -- 29) log 0.200 sec 160 > -- 30) uarmax 0.191 sec 17281 > > With 5G driver memory, the statistics after running 10 epochs are as follows: > Epoch: 10, Iter: 700, Train Loss: 0.19313544015858036, Train Accuracy: 1.0, > Val Loss: 0.025943927403263182, Val Accuracy: 0.993 > Epoch: 10, Iter: 800, Train Loss: 0.1883995965207449, Train Accuracy: 1.0, > Val Loss: 0.0260796819319468, Val Accuracy: 0.9916 > 17/02/14 20:16:40 INFO DMLScript: SystemML Statistics: > Total elapsed time: 13886.763 sec. > Total compilation time: 2.148 sec. > Total execution time: 13884.615 sec. > Number of compiled Spark inst: 147. > Number of executed Spark inst: 4. > Cache hits (Mem, WB, FS, HDFS): 1096422/0/2/2. > Cache writes (WB, FS, HDFS): 603868/2176/8. > Cache times (ACQr/m, RLS, EXP): 3.883/0.343/271.757/1.312 sec. > HOP DAGs recompiled (PRED, SB): 0/154885. > HOP DAGs recompile time: 28.290 sec. > Functions recompiled: 1. > Functions recompile time: 0.023 sec. > Spark ctx create time (lazy): 0.981 sec. > Spark trans counts (par,bc,col):0/0/2. > Spark trans times (par,bc,col): 0.000/0.000/3.501 secs. > Total JIT compile time: 45.131 sec. > Total JVM GC count: 7605. > Total JVM GC time: 157.716 sec. > Heavy hitter instructions (name, time, count): > -- 1) train 13301.811 sec 1 > -- 2) conv2d_bias_add 11890.291 sec 17362 > -- 3) conv2d_backward_filter 416.645 sec 17200 > -- 4) ba+* 252.966 sec 51762 > -- 5) sel+ 237.334 sec 25881 > -- 6) update 228.261 sec 68800 > -- 7) update_nesterov 225.383 sec 68800 > -- 8) maxpooling_backward 134.260 sec 17200 > -- 9) +* 133.959 sec 34400 > -- 10) conv2d_backward_data 128.046 sec 8600 > -- 11) relu_maxpooling 106.499 sec 17362 > -- 12) relu_backward 104.062 sec 34400 > -- 13) uack+ 90.104 sec 34400 > -- 14) r' 70.932 sec 43000 > -- 15) * 16.203 sec 95178 > -- 16) rand 16.131 sec 8613 > -- 17) / 7.988 sec 86492 > -- 18) rangeReIndex 7.640 sec 17208 > -- 19) sp_csvrblk 2.220 sec 2 > -- 20) + 2.121 sec 96528 > -- 21) uark+ 2.079 sec 43241 > -- 22) rmvar 1.580 sec 1451571 > -- 23) rshape 1.533 sec 17200 > -- 24) write 1.322 sec 9 > -- 25) createvar 0.976 sec 587259 > -- 26) - 0.961 sec 86486 > -- 27) exp 0.659 sec 17281 > -- 28) uasqk+ 0.314 sec 320 > -- 29) *2 0.312 sec 2 > -- 30) log 0.200 sec 160 > > Thanks, > > Niketan Pansare > IBM Almaden Research Center > E-mail: npansar At us.ibm.com > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar > > Matthias Boehm ---02/13/2017 04:29:12 PM---Well, I used exactly the > mnist_lenet scenario discussed in the JIRA, but what I've observed are evic > > From: Matthias Boehm <mboe...@googlemail.com> > To: dev@systemml.incubator.apache.org > Date: 02/13/2017 04:29 PM > Subject: Re: Removal of workaround flags > > > > > Well, I used exactly the mnist_lenet scenario discussed in the JIRA, but > what I've observed are eviction times <2.5% of total execution time, almost > no sparse intermediates, and the script execution time being dominated by > con2d_bias_add. Again, the discrepancy might very well stem from changes > made since the JIRA was created. > > In any case, I would rather address any existing performance issues than > globally disabling evictions (which could easily lead to OOMs) or sparse > matrix formats. Hence, I'd like to remove these workaround flags in order > to prevent shortcuts that do not apply to all users. > > Regards, > Matthias > > On Mon, Feb 13, 2017 at 9:19 AM, <dusenberr...@gmail.com> wrote: > > > Thanks for bringing up the topic. Our deep learning scripts (i.e. > > algorithms with several intermediate transformations) have shown cache > > release times to be a major bottleneck, thus leading to the creation of > > SYSTEMML-1140. Specifically, what did you use to attempt to reproduce 1140? > > > > > > -Mike > > > > -- > > > > Mike Dusenberry > > GitHub: github.com/dusenberrymw > > LinkedIn: linkedin.com/in/mikedusenberry > > > > Sent from my iPhone. > > > > > > > On Feb 12, 2017, at 12:30 AM, Matthias Boehm <mboe...@googlemail.com> > > wrote: > > > > > > SYSTEMML-1140 > > > > >