[jira] [Created] (SYSTEMML-2469) Large distributed paramserv overheads

Matthias Boehm (JIRA) Fri, 27 Jul 2018 17:36:50 -0700

Matthias Boehm created SYSTEMML-2469:
----------------------------------------


             Summary: Large distributed paramserv overheads
                 Key: SYSTEMML-2469
                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2469
             Project: SystemML
          Issue Type: Bug
            Reporter: Matthias Boehm


Initial runs with the distributed paramserv implementation on a small cluster 
revealed that it is working correctly while exhibiting large overheads. Below 
are the stats for mnist lenet, 10 epochs, ASP, update per EPOCH, on a cluster 
of 1+6 nodes (24 cores per worker node). 

{code}
otal elapsed time:             687.743 sec.
Total compilation time:         3.815 sec.
Total execution time:           683.928 sec.
Number of compiled Spark inst:  330.
Number of executed Spark inst:  0.
Cache hits (Mem, WB, FS, HDFS): 176210/0/0/2.
Cache writes (WB, FS, HDFS):    29856/5271/0.
Cache times (ACQr/m, RLS, EXP): 1.178/0.087/198.892/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/1629.
HOP DAGs recompile time:        4.878 sec.
Functions recompiled:           1.
Functions recompile time:       0.097 sec.
Spark ctx create time (lazy):   22.222 sec.
Spark trans counts (par,bc,col):2/1/0.
Spark trans times (par,bc,col): 0.390/0.242/0.000 secs.
Paramserv total num workers:    144.
Paramserv setup time:           68.259 secs.
Paramserv grad compute time:    6952.163 secs.
Paramserv model update time:    2453.448/422.955 secs.
Paramserv model broadcast time: 24.982 secs.
Paramserv batch slice time:     0.204 secs.
Paramserv RPC request time:     51611.210 secs.
ParFor loops optimized:         1.
ParFor optimize time:           0.462 sec.
ParFor initialize time:         0.049 sec.
ParFor result merge time:       0.028 sec.
ParFor total update in-place:   0/188/188
Total JIT compile time:         98.786 sec.
Total JVM GC count:             68.
Total JVM GC time:              25.858 sec.
Heavy hitter instructions:
  #  Instruction      Time(s)  Count
  1  paramserv        665.479      1
  2  +                182.410  18636
  3  conv2d_bias_add  150.938    376
  4  sqrt              69.768  11528
  5  /                 54.836  11732
  6  ba+*              45.901    376
  7  *                 38.046  11727
  8  -                 37.428  12096
  9  ^2                35.533   6344
 10  exp               21.022    188
{code}

There seem to be three distinct issues:
* Too larger number of tasks on assembling the distributed input data (in the 
number of rows, i.e., >50,000 tasks), which makes the distributed data 
partitioning very slow (multiple minutes).
* Evictions from the buffer pool at the driver node (see cache writes). This is 
likely due to disabling cleanup (and missing explicit cleanup) of all RPC 
objects.
* Large RPC overhead: This might be due to the evictions happening in the 
critical path and all 144 workers waiting with their RPC requests. However, in 
addition we should also double check that the number of RPC handler threads is 
correct, if we could get the serialization and communication out of the 
critical (i.e., synchronized) path of model updates, and address unnecessary 
serialization/deserialization overheads.

[~Guobao] I'll help reducing the serialization/deserialization overheads, but 
it would be great if you could have a look into the other issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SYSTEMML-2469) Large distributed paramserv overheads

Reply via email to