[ 
https://issues.apache.org/jira/browse/SYSTEMML-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566118#comment-16566118
 ] 

Matthias Boehm commented on SYSTEMML-2478:
------------------------------------------

Well, first of all we're not executing MR but SPARK instructions here. Second, 
yes, there seems to be an issue but I was not able to reproduce yet because 
(even after fixing the order of model entries to allow indexed access) there 
are still some incorrect lookups that ultimately result in dimension mismatches 
on aggregation with ADAM. So let's use the sequential aggregation for now and I 
have to come back to this later.

> Overhead when using parfor in update func
> -----------------------------------------
>
>                 Key: SYSTEMML-2478
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2478
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: LI Guobao
>            Priority: Major
>
> When using parfor inside update function, some MR tasks are launched to write 
> the output of task. And it took more time to finish the paramserv run than 
> without parfor in update function. The scenario is to launch the ASP Epoch DC 
> spark paramserv test.
> Here is the stack:
> {code:java}
> Total elapsed time:           101.804 sec.
> Total compilation time:               3.690 sec.
> Total execution time:         98.114 sec.
> Number of compiled Spark inst:        302.
> Number of executed Spark inst:        540.
> Cache hits (Mem, WB, FS, HDFS):       57839/0/0/240.
> Cache writes (WB, FS, HDFS):  14567/58/61.
> Cache times (ACQr/m, RLS, EXP):       42.346/0.064/4.761/20.280 sec.
> HOP DAGs recompiled (PRED, SB):       0/144.
> HOP DAGs recompile time:      0.507 sec.
> Functions recompiled:         16.
> Functions recompile time:     0.064 sec.
> Spark ctx create time (lazy): 1.376 sec.
> Spark trans counts (par,bc,col):270/1/240.
> Spark trans times (par,bc,col):       0.573/0.197/42.255 secs.
> Paramserv total num workers:  3.
> Paramserv setup time:         1.559 secs.
> Paramserv grad compute time:  105.701 secs.
> Paramserv model update time:  56.801/47.193 secs.
> Paramserv model broadcast time:       23.872 secs.
> Paramserv batch slice time:   0.000 secs.
> Paramserv RPC request time:   105.159 secs.
> ParFor loops optimized:               1.
> ParFor optimize time:         0.040 sec.
> ParFor initialize time:               0.434 sec.
> ParFor result merge time:     0.005 sec.
> ParFor total update in-place: 0/7/7
> Total JIT compile time:               68.384 sec.
> Total JVM GC count:           1120.
> Total JVM GC time:            22.338 sec.
> Heavy hitter instructions:
>   #  Instruction             Time(s)  Count
>   1  paramserv                97.221      1
>   2  conv2d_bias_add          60.581    614
>   3  *                        54.990  12447
>   4  sp_-                     20.625    240
>   5  -                        17.979   7287
>   6  +                        14.191  12824
>   7  r'                        5.636   1200
>   8  conv2d_backward_filter    5.123    600
>   9  max                       4.985    907
>  10  ba+*                      4.591   1814
> {code}
> Here is the polished update func:
> {code:java}
> aggregation = function(list[unknown] model,
>                        list[unknown] gradients,
>                        list[unknown] hyperparams)
>    return (list[unknown] modelResult) {
>      lr = as.double(as.scalar(hyperparams["lr"]))
>      mu = as.double(as.scalar(hyperparams["mu"]))
>      modelResult = model
>      # Optimize with SGD w/ Nesterov momentum
>      parfor(i in 1:8, check=0) {
>        P = as.matrix(model[i])
>        dP = as.matrix(gradients[i])
>        vP = as.matrix(model[8+i])
>        [P, vP] = sgd_nesterov::update(P, dP, lr, mu, vP)
>        modelResult[i] = P
>        modelResult[8+i] = vP
>      }
>    }
> {code}
> [~mboehm7], in fact, I have no idea where the cause comes from? It seems that 
> it tried to write the parfor task output into HDFS. So is it the normal 
> behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to