Matthias Boehm created SYSTEMML-2398:
----------------------------------------

             Summary: Paramserv ASP aggregation overhead in on update per epoch
                 Key: SYSTEMML-2398
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2398
             Project: SystemML
          Issue Type: Bug
            Reporter: Matthias Boehm


Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP
{code}
SystemML Statistics:
Total elapsed time:             449.548 sec.
Total compilation time:         1.995 sec.
Total execution time:           447.553 sec.
Number of compiled MR Jobs:     0.
Number of executed MR Jobs:     0.
Cache hits (Mem, WB, FS, HDFS): 970241/0/0/2.
Cache writes (WB, FS, HDFS):    55191/0/0.
Cache times (ACQr/m, RLS, EXP): 1.048/0.120/1.087/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/13582.
HOP DAGs recompile time:        24.473 sec.
Functions recompiled:           1.
Functions recompile time:       0.013 sec.
Paramserv func number of workers:       79.
Paramserv func total gradients compute time:    1714.962 secs.
Paramserv func total aggregation time:  428.499 secs.
Paramserv func model broadcasting time: 2.080 secs.
Paramserv func total batch slicing time:        0.0190000000 secs.
Total JIT compile time:         37.461 sec.
Total JVM GC count:             66.
Total JVM GC time:              7.098 sec.
Heavy hitter instructions:
  #  Instruction             Time(s)  Count
  1  conv2d_bias_add         719.111  13768
  2  paramserv               437.051      1
  3  relu_backward           210.414  20370
  4  ba+*                    180.001  40928
  5  conv2d_backward_filter  175.104  13580
  6  +*                      156.714  81480
  7  conv2d_backward_data    140.779   6790
  8  *                       123.502  95173
  9  -*                      104.058  54320
 10  -                        94.502  74985
{code}

As we see the aggregation is a major bottleneck. This is unexpected due to the 
coarse-grained update per epoch. [~Guobao] could you please have a look and 
profile where this is coming from?




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to