[ https://issues.apache.org/jira/browse/SYSTEMML-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995706#comment-15995706 ]
Mike Dusenberry commented on SYSTEMML-1566: ------------------------------------------- Update: I've made some significant improvements from SYSTEMML-1554, SYSTEMML-1575, SYSTEMML-1561 that have cut the execution time for this script in half (1022s -> 493s). Notice how the number of executed Spark jobs has decreased considerably. SYSTEMML-1561 is still a work in progress, but I have a prototype solution. {code} 17/05/03 14:17:55 INFO DMLScript: SystemML Statistics: Total elapsed time: 493.003 sec. Total compilation time: 1.852 sec. Total execution time: 491.151 sec. Number of compiled Spark inst: 143. Number of executed Spark inst: 793. Cache hits (Mem, WB, FS, HDFS): 156654/0/0/2. Cache writes (WB, FS, HDFS): 79043/0/8. Cache times (ACQr/m, RLS, EXP): 3.856/0.052/6.247/1.473 sec. HOP DAGs recompiled (PRED, SB): 0/5978. HOP DAGs recompile time: 3.750 sec. Functions recompiled: 10. Functions recompile time: 0.087 sec. Spark ctx create time (lazy): 0.862 sec. Spark trans counts (par,bc,col):789/789/2. Spark trans times (par,bc,col): 1.049/0.357/3.771 secs. Total JIT compile time: 145.35 sec. Total JVM GC count: 433. Total JVM GC time: 9.107 sec. Heavy hitter instructions (name, time, count): -- 1) train 450.677 sec 1 -- 2) conv2d_bias_add 219.343 sec 3298 -- 3) predict 143.905 sec 9 -- 4) conv2d_backward_filter 73.291 sec 1720 -- 5) ba+* 23.381 sec 5949 -- 6) sel+ 20.917 sec 3369 -- 7) +* 19.232 sec 10320 -- 8) conv2d_backward_data 16.645 sec 860 -- 9) sp_mapmm 15.905 sec 789 -- 10) relu_maxpooling 14.457 sec 3298 {code} > Possible regression from 0.13 -> 0.14 for MNIST LeNet script > ------------------------------------------------------------ > > Key: SYSTEMML-1566 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1566 > Project: SystemML > Issue Type: Bug > Reporter: Mike Dusenberry > Fix For: SystemML 1.0 > > Attachments: explain.txt, stats.txt > > > For the 0.14 release testing, I tried out the [MNIST LeNet example | > https://github.com/apache/incubator-systemml/blob/master/scripts/nn/examples/mnist_lenet-train.dml] > on both 0.13 and 0.14 and noticed a possible regression. Basically, on 0.14 > the script took longer to run and had 2513 Spark instructions executed, while > on 0.13 only 864 Spark instructions were executed. This was run locally on a > laptop using the 2 instructions at the top of the script (and copied below). > I've also attached the stats and runtime explain logs. > 1. Download data > {code} > nn/examples/get_mnist_data.sh > {code} > 2. Execute from the {{scripts}} directory. > {code} > spark-submit --master local[*] --driver-memory 10G --conf > spark.driver.maxResultSize=0 --conf spark.rpc.message.maxSize=128 > SystemML.jar -f nn/examples/mnist_lenet-train.dml -stats -explain -nvargs > train=nn/examples/data/mnist/mnist_train.csv > test=nn/examples/data/mnist/mnist_test.csv C=1 Hin=28 Win=28 epochs=1 > out_dir=nn/examples/model/mnist_lenet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)