Matthias Boehm created SYSTEMML-2469:
----------------------------------------
Summary: Large distributed paramserv overheads
Key: SYSTEMML-2469
URL: https://issues.apache.org/jira/browse/SYSTEMML-2469
Project: SystemML
Issue Type: Bug
Reporter: Matthias Boehm
Initial runs with the distributed paramserv implementation on a small cluster
revealed that it is working correctly while exhibiting large overheads. Below
are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster
of 1+6 nodes (24 cores per worker node).
{code}
otal elapsed time: 687.743 sec.
Total compilation time: 3.815 sec.
Total execution time: 683.928 sec.
Number of compiled Spark inst: 330.
Number of executed Spark inst: 0.
Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
Cache writes (WB, FS, HDFS): 29856/5271/0.
Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/1629.
HOP DAGs recompile time: 4.878 sec.
Functions recompiled: 1.
Functions recompile time: 0.097 sec.
Spark ctx create time (lazy): 22.222 sec.
Spark trans counts (par,bc,col):2/1/0.
Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
Paramserv total num workers: 144.
Paramserv setup time: 68.259 secs.
Paramserv grad compute time: 6952.163 secs.
Paramserv model update time: 2453.448/422.955 secs.
Paramserv model broadcast time: 24.982 secs.
Paramserv batch slice time: 0.204 secs.
Paramserv RPC request time: 51611.210 secs.
ParFor loops optimized: 1.
ParFor optimize time: 0.462 sec.
ParFor initialize time: 0.049 sec.
ParFor result merge time: 0.028 sec.
ParFor total update in-place: 0/188/188
Total JIT compile time: 98.786 sec.
Total JVM GC count: 68.
Total JVM GC time: 25.858 sec.
Heavy hitter instructions:
# Instruction Time(s) Count
1 paramserv 665.479 1
2 + 182.410 18636
3 conv2d_bias_add 150.938 376
4 sqrt 69.768 11528
5 / 54.836 11732
6 ba+* 45.901 376
7 * 38.046 11727
8 - 37.428 12096
9 ^2 35.533 6344
10 exp 21.022 188
{code}
There seem to be three distinct issues:
* Too larger number of tasks on assembling the distributed input data (in the
number of rows, i.e., >50,000 tasks), which makes the distributed data
partitioning very slow (multiple minutes).
* Evictions from the buffer pool at the driver node (see cache writes). This is
likely due to disabling cleanup (and missing explicit cleanup) of all RPC
objects.
* Large RPC overhead: This might be due to the evictions happening in the
critical path and all 144 workers waiting with their RPC requests. However, in
addition we should also double check that the number of RPC handler threads is
correct, if we could get the serialization and communication out of the
critical (i.e., synchronized) path of model updates, and address unnecessary
serialization/deserialization overheads.
[~Guobao] I'll help reducing the serialization/deserialization overheads, but
it would be great if you could have a look into the other issues.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)