Matthias Boehm created SYSTEMML-2398:
----------------------------------------
Summary: Paramserv ASP aggregation overhead in on update per epoch
Key: SYSTEMML-2398
URL: https://issues.apache.org/jira/browse/SYSTEMML-2398
Project: SystemML
Issue Type: Bug
Reporter: Matthias Boehm
Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP
{code}
SystemML Statistics:
Total elapsed time: 449.548 sec.
Total compilation time: 1.995 sec.
Total execution time: 447.553 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 970241/0/0/2.
Cache writes (WB, FS, HDFS): 55191/0/0.
Cache times (ACQr/m, RLS, EXP): 1.048/0.120/1.087/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/13582.
HOP DAGs recompile time: 24.473 sec.
Functions recompiled: 1.
Functions recompile time: 0.013 sec.
Paramserv func number of workers: 79.
Paramserv func total gradients compute time: 1714.962 secs.
Paramserv func total aggregation time: 428.499 secs.
Paramserv func model broadcasting time: 2.080 secs.
Paramserv func total batch slicing time: 0.0190000000 secs.
Total JIT compile time: 37.461 sec.
Total JVM GC count: 66.
Total JVM GC time: 7.098 sec.
Heavy hitter instructions:
# Instruction Time(s) Count
1 conv2d_bias_add 719.111 13768
2 paramserv 437.051 1
3 relu_backward 210.414 20370
4 ba+* 180.001 40928
5 conv2d_backward_filter 175.104 13580
6 +* 156.714 81480
7 conv2d_backward_data 140.779 6790
8 * 123.502 95173
9 -* 104.058 54320
10 - 94.502 74985
{code}
As we see the aggregation is a major bottleneck. This is unexpected due to the
coarse-grained update per epoch. [~Guobao] could you please have a look and
profile where this is coming from?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)