[ 
https://issues.apache.org/jira/browse/SYSTEMML-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao resolved SYSTEMML-2469.
---------------------------------
       Resolution: Fixed
    Fix Version/s: SystemML 1.2

> Large distributed paramserv overheads
> -------------------------------------
>
>                 Key: SYSTEMML-2469
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2469
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>             Fix For: SystemML 1.2
>
>
> Initial runs with the distributed paramserv implementation on a small cluster 
> revealed that it is working correctly while exhibiting large overheads. Below 
> are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster 
> of 1+6 nodes (24 cores per worker node). 
> {code}
> otal elapsed time:             687.743 sec.
> Total compilation time:         3.815 sec.
> Total execution time:           683.928 sec.
> Number of compiled Spark inst:  330.
> Number of executed Spark inst:  0.
> Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
> Cache writes (WB, FS, HDFS):    29856/5271/0.
> Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/1629.
> HOP DAGs recompile time:        4.878 sec.
> Functions recompiled:           1.
> Functions recompile time:       0.097 sec.
> Spark ctx create time (lazy):   22.222 sec.
> Spark trans counts (par,bc,col):2/1/0.
> Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
> Paramserv total num workers:    144.
> Paramserv setup time:           68.259 secs.
> Paramserv grad compute time:    6952.163 secs.
> Paramserv model update time:    2453.448/422.955 secs.
> Paramserv model broadcast time: 24.982 secs.
> Paramserv batch slice time:     0.204 secs.
> Paramserv RPC request time:     51611.210 secs.
> ParFor loops optimized:         1.
> ParFor optimize time:           0.462 sec.
> ParFor initialize time:         0.049 sec.
> ParFor result merge time:       0.028 sec.
> ParFor total update in-place:   0/188/188
> Total JIT compile time:         98.786 sec.
> Total JVM GC count:             68.
> Total JVM GC time:              25.858 sec.
> Heavy hitter instructions:
>   #  Instruction      Time(s)  Count
>   1  paramserv        665.479      1
>   2  +                182.410  18636
>   3  conv2d_bias_add  150.938    376
>   4  sqrt              69.768  11528
>   5  /                 54.836  11732
>   6  ba+*              45.901    376
>   7  *                 38.046  11727
>   8  -                 37.428  12096
>   9  ^2                35.533   6344
>  10  exp               21.022    188
> {code}
> There seem to be three distinct issues:
> * Too larger number of tasks on assembling the distributed input data (in the 
> number of rows, i.e., >50,000 tasks), which makes the distributed data 
> partitioning very slow (multiple minutes).
> * Evictions from the buffer pool at the driver node (see cache writes). This 
> is likely due to disabling cleanup (and missing explicit cleanup) of all RPC 
> objects.
> * Large RPC overhead: This might be due to the evictions happening in the 
> critical path and all 144 workers waiting with their RPC requests. However, 
> in addition we should also double check that the number of RPC handler 
> threads is correct, if we could get the serialization and communication out 
> of the critical (i.e., synchronized) path of model updates, and address 
> unnecessary serialization/deserialization overheads.
> [~Guobao] I'll help reducing the serialization/deserialization overheads, but 
> it would be great if you could have a look into the other issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to