[ 
https://issues.apache.org/jira/browse/OFBIZ-13383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079510#comment-18079510
 ] 

Ankit Joshi commented on OFBIZ-13383:
-------------------------------------

Thanks [~jacopoc] , [~nj] for thoughts on this.

Definitely, it make sense to keep the implementation as straight-forward and 
simple to cover standard cases as much as possible. Giving some time to analyze 
the first draft of implementation could give us a better direction to handle 
any real-time complicated scenarios if observed.

For your question [~nj] , I explored it a bit more and it could be possible but 
rare case of ownership duplicity or override. Once, the initial draft is ready, 
this aspects could be more explored to implement some claim and then assign a 
job to handle such rare cases.

[~nmalin] 

> Handle and resume stale jobs in case of application node(s) running with 
> Auto-scaling enabled
> ---------------------------------------------------------------------------------------------
>
>                 Key: OFBIZ-13383
>                 URL: https://issues.apache.org/jira/browse/OFBIZ-13383
>             Project: OFBiz
>          Issue Type: Task
>            Reporter: Ankit Joshi
>            Assignee: Ankit Joshi
>            Priority: Major
>         Attachments: Stale-Jobs-Management (1).png
>
>
> {*}Scenario{*}: 
> In an OFBiz environment running with multiple auto-scaling nodes, individual 
> nodes may be restarted, replaced, or terminated due to ASG relaunch, health 
> failure or unexpected node termination. When such a node goes down, the 
> JobSandbox.runByInstanceId still holds the old node's Instance ID, causing 
> the jobs assigned to that node to remain stuck. This prevents other active 
> nodes from re-picking the jobs because the runByInstanceId still points to 
> the inactive node.
> {*}Issue{*}:
>  * Jobs remain assigned to the old, inactive node.
>  * Jobs got stuck as no active node can pick these jobs considered them as 
> owned by the old node (runByInstanceId=oldNode).
> *Implementation Plan to address this issue*
> {*}Configuration settings{*}. (at the Service-Engine thread pool level)
>  * lease-refresh-millis -> 300000 (in millisecs) (every 5 minutes interval)
>  * lease-validation-millis -> 480000 (in millisecs) (every 8 minutes interval)
>  * lease-expiry-millis -> 600000 (leaseUpdateStamp should not be > 10 minutes 
> from now())
>  # A new custom field *leaseUpdatedStamp* will be introduced at the 
> individual job level (JobSandbox.leaseUpdatedStamp) that will tracks last 
> time when the node was still alive and actively processing their jobs.
>  # {*}Heartbeat / Lease Renewal{*}: At a configured interval, every node will 
> update the leaseUpdatedStamp with *now()* for the all the jobs it owned.
>  # {*}Lease Expiry Validation{*}: All active nodes will periodically validate 
> all the running jobs to ensure if their leaseUpdatedStamp is being updated 
> within the configured threshold.
>  # {*}Recover Stale Jobs{*}: While validating the jobs, any job(s) whose 
> leaseUpdatedStamp is not updated within the threshold, the node owning that 
> job is considered as stale/terminated and all the jobs owned by that node 
> will be released.
>  # {*}Logic to release stale jobs{*}:
> Jobs that are in Queued or Running status having runByInstanceId assigned and 
> leaseUpdatedStamp exceeds threshold will be moved to Pending status with 
> runByInstanceId set as Null. With this, other active nodes will consider such 
> job(s) as available and will pick them for further execution.
> *Sample Workflow with Example*
> 10:00 AM – Node A picks Job X.
> 10:05 AM – Node A renews its lease for Job X.
> 10:10 AM – Node A renews the lease again.
> 10:11 AM – Node A crashes.
> Job Poller Recovery Process:
> 10:08 AM: Lease validation occurs. Node A is still considered active (because 
> the expiry threshold hasn't been reached yet).
> 10:16 AM: Lease validation runs again. Node A is still considered active 
> (because the expiry threshold hasn't been reached yet).
> 10:24 AM: Lease validation occurs. Node A is now dead because the expiry 
> threshold (10 minutes) has passed.
> The Job Poller will free the job assigned to Node A, and it will now be 
> available for another active node to pick.
> Workflow Diagram:
> Refer Stale-Jobs-Management (1).png



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to