[ https://issues.apache.org/jira/browse/SYSTEMML-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LI Guobao resolved SYSTEMML-2469. --------------------------------- Resolution: Fixed Fix Version/s: SystemML 1.2 > Large distributed paramserv overheads > ------------------------------------- > > Key: SYSTEMML-2469 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2469 > Project: SystemML > Issue Type: Bug > Reporter: Matthias Boehm > Assignee: LI Guobao > Priority: Major > Fix For: SystemML 1.2 > > > Initial runs with the distributed paramserv implementation on a small cluster > revealed that it is working correctly while exhibiting large overheads. Below > are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster > of 1+6 nodes (24 cores per worker node). > {code} > otal elapsed time: 687.743 sec. > Total compilation time: 3.815 sec. > Total execution time: 683.928 sec. > Number of compiled Spark inst: 330. > Number of executed Spark inst: 0. > Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2. > Cache writes (WB, FS, HDFS): 29856/5271/0. > Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/1629. > HOP DAGs recompile time: 4.878 sec. > Functions recompiled: 1. > Functions recompile time: 0.097 sec. > Spark ctx create time (lazy): 22.222 sec. > Spark trans counts (par,bc,col):2/1/0. > Spark trans times (par,bc,col): 0.390/0.242/0.000 secs. > Paramserv total num workers: 144. > Paramserv setup time: 68.259 secs. > Paramserv grad compute time: 6952.163 secs. > Paramserv model update time: 2453.448/422.955 secs. > Paramserv model broadcast time: 24.982 secs. > Paramserv batch slice time: 0.204 secs. > Paramserv RPC request time: 51611.210 secs. > ParFor loops optimized: 1. > ParFor optimize time: 0.462 sec. > ParFor initialize time: 0.049 sec. > ParFor result merge time: 0.028 sec. > ParFor total update in-place: 0/188/188 > Total JIT compile time: 98.786 sec. > Total JVM GC count: 68. > Total JVM GC time: 25.858 sec. > Heavy hitter instructions: > # Instruction Time(s) Count > 1 paramserv 665.479 1 > 2 + 182.410 18636 > 3 conv2d_bias_add 150.938 376 > 4 sqrt 69.768 11528 > 5 / 54.836 11732 > 6 ba+* 45.901 376 > 7 * 38.046 11727 > 8 - 37.428 12096 > 9 ^2 35.533 6344 > 10 exp 21.022 188 > {code} > There seem to be three distinct issues: > * Too larger number of tasks on assembling the distributed input data (in the > number of rows, i.e., >50,000 tasks), which makes the distributed data > partitioning very slow (multiple minutes). > * Evictions from the buffer pool at the driver node (see cache writes). This > is likely due to disabling cleanup (and missing explicit cleanup) of all RPC > objects. > * Large RPC overhead: This might be due to the evictions happening in the > critical path and all 144 workers waiting with their RPC requests. However, > in addition we should also double check that the number of RPC handler > threads is correct, if we could get the serialization and communication out > of the critical (i.e., synchronized) path of model updates, and address > unnecessary serialization/deserialization overheads. > [~Guobao] I'll help reducing the serialization/deserialization overheads, but > it would be great if you could have a look into the other issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)