Let me add more on this. On Wed, Apr 2, 2014 at 11:24 PM, Nandika Jayawardana <[email protected]>wrote:
> Hi All, > > BPEL processes in ode are executed by first writing the process initiating > message to the database and running a job (a separate thread ) against it > which will try to execute the process definition against the message and > current process state. Ode job scheduler is what does the loading of jobs > from the job table for execution.Usually, all the jobs scheduled from a > given BPS node is loaded and executed by that node itself. However, in case > of a long running process, where there can be waiting activities or timer > jobs, there will be jobs that are not loaded immediately. These jobs > are scheduled against a node id identifying the running ode instance. In > our current cluster implementation, if a node goes down, these jobs will > remain there until the node is restarted. In order to solve this problem > ,we though of using the hazel-cast cluster which is already available and > used. > > Ode scheduler has a heartbeat method, which should be called in the > cluster implementation periodically by each node in order to tell the > scheduler that each node is alive. A separate task is executed to monitor > the availability/staleness of the nodes periodically by maintaining a list > of known nodes and nodes that have not updated the heartbeat. > > Job redistribution has two problems. > > 1. If all the nodes in the cluster try to redistribute the jobs of a stale > node, there will be dead locks since these are db operations. Hence we need > to elect a node in the cluster which will do the job redistribution. > In hazelcast, we can select the leader by obtaining the oldest member > of the cluster and treating it as the leader. > Recover stale nodes job is one of the schedule jobs in ode. ODE developers make it to start randomly with intention of, jobs don't overlap in clustered environment. But this doesn't grantee it and if two jobs overlapped with each other, it cause deadlocks in ODE_JOB tables. There is another scheduled job called updateJob which assign nodeIDs to near-future jobs. Same problem can happen if two updateJobs overlapped. So the suggested solution is, every node check whether he is the leader and if he is the leader then run these two jobs for whole cluster. > 2. We need a way to implement the heartbeat method. > > Initially we though of using the membership listeners available from > carbon clustering. However, it seems when nodes are added/removed, all > nodes does not get notified. Hence we though of using the periodic task > running in the scheduler to do the member availability check and update the > heartbeat using the hazel-cast cluster. > > @Hasitha - Please add anything i missed. > > Regards > Nandika > > -- > Nandika Jayawardana > Senior Technical Lead > WSO2 Inc ; http://wso2.com > lean.enterprise.middleware > Thanks, Hasitha. -- Hasitha Aravinda, Software Engineer, WSO2 Inc. Email: [email protected] Mobile: +94 71 8 210 200
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
