[
https://issues.apache.org/jira/browse/SYSTEMML-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LI Guobao resolved SYSTEMML-2469.
---------------------------------
Resolution: Fixed
Fix Version/s: SystemML 1.2
> Large distributed paramserv overheads
> -------------------------------------
>
> Key: SYSTEMML-2469
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2469
> Project: SystemML
> Issue Type: Bug
> Reporter: Matthias Boehm
> Assignee: LI Guobao
> Priority: Major
> Fix For: SystemML 1.2
>
>
> Initial runs with the distributed paramserv implementation on a small cluster
> revealed that it is working correctly while exhibiting large overheads. Below
> are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster
> of 1+6 nodes (24 cores per worker node).
> {code}
> otal elapsed time: 687.743 sec.
> Total compilation time: 3.815 sec.
> Total execution time: 683.928 sec.
> Number of compiled Spark inst: 330.
> Number of executed Spark inst: 0.
> Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
> Cache writes (WB, FS, HDFS): 29856/5271/0.
> Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/1629.
> HOP DAGs recompile time: 4.878 sec.
> Functions recompiled: 1.
> Functions recompile time: 0.097 sec.
> Spark ctx create time (lazy): 22.222 sec.
> Spark trans counts (par,bc,col):2/1/0.
> Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
> Paramserv total num workers: 144.
> Paramserv setup time: 68.259 secs.
> Paramserv grad compute time: 6952.163 secs.
> Paramserv model update time: 2453.448/422.955 secs.
> Paramserv model broadcast time: 24.982 secs.
> Paramserv batch slice time: 0.204 secs.
> Paramserv RPC request time: 51611.210 secs.
> ParFor loops optimized: 1.
> ParFor optimize time: 0.462 sec.
> ParFor initialize time: 0.049 sec.
> ParFor result merge time: 0.028 sec.
> ParFor total update in-place: 0/188/188
> Total JIT compile time: 98.786 sec.
> Total JVM GC count: 68.
> Total JVM GC time: 25.858 sec.
> Heavy hitter instructions:
> # Instruction Time(s) Count
> 1 paramserv 665.479 1
> 2 + 182.410 18636
> 3 conv2d_bias_add 150.938 376
> 4 sqrt 69.768 11528
> 5 / 54.836 11732
> 6 ba+* 45.901 376
> 7 * 38.046 11727
> 8 - 37.428 12096
> 9 ^2 35.533 6344
> 10 exp 21.022 188
> {code}
> There seem to be three distinct issues:
> * Too larger number of tasks on assembling the distributed input data (in the
> number of rows, i.e., >50,000 tasks), which makes the distributed data
> partitioning very slow (multiple minutes).
> * Evictions from the buffer pool at the driver node (see cache writes). This
> is likely due to disabling cleanup (and missing explicit cleanup) of all RPC
> objects.
> * Large RPC overhead: This might be due to the evictions happening in the
> critical path and all 144 workers waiting with their RPC requests. However,
> in addition we should also double check that the number of RPC handler
> threads is correct, if we could get the serialization and communication out
> of the critical (i.e., synchronized) path of model updates, and address
> unnecessary serialization/deserialization overheads.
> [~Guobao] I'll help reducing the serialization/deserialization overheads, but
> it would be great if you could have a look into the other issues.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)