[ 
https://issues.apache.org/jira/browse/SYSTEMML-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao closed SYSTEMML-2398.
-------------------------------

It is resolved by avoiding invoking the synchronised _updateModel_ method by 
multiple worker threads which leads to the intense serializationbetween ps and 
workers.

> Paramserv ASP aggregation overhead on update per epoch
> ------------------------------------------------------
>
>                 Key: SYSTEMML-2398
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2398
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>             Fix For: SystemML 1.2
>
>
> Here are the statistics of mnist60K, 2 epochs, 80 workers in ASP
> {code}
> SystemML Statistics:
> Total elapsed time:           449.548 sec.
> Total compilation time:               1.995 sec.
> Total execution time:         447.553 sec.
> Number of compiled MR Jobs:   0.
> Number of executed MR Jobs:   0.
> Cache hits (Mem, WB, FS, HDFS):       970241/0/0/2.
> Cache writes (WB, FS, HDFS):  55191/0/0.
> Cache times (ACQr/m, RLS, EXP):       1.048/0.120/1.087/0.000 sec.
> HOP DAGs recompiled (PRED, SB):       0/13582.
> HOP DAGs recompile time:      24.473 sec.
> Functions recompiled:         1.
> Functions recompile time:     0.013 sec.
> Paramserv func number of workers:     79.
> Paramserv func total gradients compute time:  1714.962 secs.
> Paramserv func total aggregation time:        428.499 secs.
> Paramserv func model broadcasting time:       2.080 secs.
> Paramserv func total batch slicing time:      0.0190000000 secs.
> Total JIT compile time:               37.461 sec.
> Total JVM GC count:           66.
> Total JVM GC time:            7.098 sec.
> Heavy hitter instructions:
>   #  Instruction             Time(s)  Count
>   1  conv2d_bias_add         719.111  13768
>   2  paramserv               437.051      1
>   3  relu_backward           210.414  20370
>   4  ba+*                    180.001  40928
>   5  conv2d_backward_filter  175.104  13580
>   6  +*                      156.714  81480
>   7  conv2d_backward_data    140.779   6790
>   8  *                       123.502  95173
>   9  -*                      104.058  54320
>  10  -                        94.502  74985
> {code}
> As we see the aggregation is a major bottleneck. This is unexpected due to 
> the coarse-grained update per epoch. [~Guobao] could you please have a look 
> and profile where this is coming from?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to