[jira] [Comment Edited] (SYSTEMML-2349) Local worker error handling

Matthias Boehm (JIRA) Wed, 30 May 2018 14:14:33 -0700


    [ 
https://issues.apache.org/jira/browse/SYSTEMML-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495685#comment-16495685
 ]


Matthias Boehm edited comment on SYSTEMML-2349 at 5/30/18 9:13 PM:
-------------------------------------------------------------------

Well, since both workers and aggregator execute user-defined functions we need 
to make the threads robust enough to handle errors and ensure termination. One 
approach to do that is a callback mechanism: (1) keep the threads as members in 
the invoking instruction, (2) pass the instruction as an argument into the 
workers and aggregator, (3) wrap the invocation of user-defined functions into 
try-catch, and (4) call a proper shutdown method of the instruction (with 
access to the thread members) from the respective catch clauses.

Also, once you reworked the queuing logic, we might be able to avoid running 
the aggregator as a thread. I could imagine a design where we simply call 
execute on the parameter server, and this parameter server internally spawns 
its workers and simply returns whenever all workers are done or an error 
occurred.



was (Author: mboehm7):
Well, since both workers and aggregator execute user-defined functions we need 
to make the threads robust enough the handle errors and ensure termination. One 
approach to do that is to use a callback as follows: (1) keep the threads as 
members in the invoking instruction, (2) pass the instruction as an argument 
into the workers and aggregator, (3) wrap the invocation of user-defined 
functions into try-catch, and (4) call a proper shutdown method of the 
instruction (with access to the thread members) from the respective catch 
clauses.

Also, once you reworked the queuing logic, we might be able to avoid running 
the aggregator as a thread. I could imagine a design where we simply call 
execute on the parameter server, and this parameter server internally spawns 
its workers and simply returns whenever all workers are done or an error 
occurred.


> Local worker error handling
> ---------------------------
>
>                 Key: SYSTEMML-2349
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2349
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: LI Guobao
>            Assignee: LI Guobao
>            Priority: Major
>
> While playing around with the locking scheme of the parameter server, I 
> encountered unrelated errors that led to the parameter server hanging. We 
> need to make sure all worker errors are correctly propagated so that we can 
> guarantee termination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (SYSTEMML-2349) Local worker error handling

Reply via email to