[ 
https://issues.apache.org/jira/browse/OFBIZ-13383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081175#comment-18081175
 ] 

Ankit Joshi commented on OFBIZ-13383:
-------------------------------------

Thanks [~jacopoc]  for your detailed insights on this requirement.

Per analysis and reviewing the design notes, agreed that the race 
conditions/data override conditions is efficiently addressed through the atomic 
updates.

> Handle and resume stale jobs in case of application node(s) running with 
> Auto-scaling enabled
> ---------------------------------------------------------------------------------------------
>
>                 Key: OFBIZ-13383
>                 URL: https://issues.apache.org/jira/browse/OFBIZ-13383
>             Project: OFBiz
>          Issue Type: Task
>            Reporter: Ankit Joshi
>            Assignee: Ankit Joshi
>            Priority: Major
>         Attachments: Stale-Jobs-Management (1).png
>
>
> {*}Scenario{*}: 
> In an OFBiz environment running with multiple auto-scaling nodes, individual 
> nodes may be restarted, replaced, or terminated due to ASG relaunch, health 
> failure or unexpected node termination. When such a node goes down, the 
> JobSandbox.runByInstanceId still holds the old node's Instance ID, causing 
> the jobs assigned to that node to remain stuck. This prevents other active 
> nodes from re-picking the jobs because the runByInstanceId still points to 
> the inactive node.
> {*}Issue{*}:
>  * Jobs remain assigned to the old, inactive node.
>  * Jobs got stuck as no active node can pick these jobs considered them as 
> owned by the old node (runByInstanceId=oldNode).
> *Implementation Plan to address this issue*
> {*}Configuration settings{*}. (at the Service-Engine thread pool level)
>  * lease-refresh-millis -> 300000 (in millisecs) (every 5 minutes interval)
>  * lease-validation-millis -> 480000 (in millisecs) (every 8 minutes interval)
>  * lease-expiry-millis -> 600000 (leaseUpdateStamp should not be > 10 minutes 
> from now())
>  # A new custom field *leaseUpdatedStamp* will be introduced at the 
> individual job level (JobSandbox.leaseUpdatedStamp) that will tracks last 
> time when the node was still alive and actively processing their jobs.
>  # {*}Heartbeat / Lease Renewal{*}: At a configured interval, every node will 
> update the leaseUpdatedStamp with *now()* for the all the jobs it owned.
>  # {*}Lease Expiry Validation{*}: All active nodes will periodically validate 
> all the running jobs to ensure if their leaseUpdatedStamp is being updated 
> within the configured threshold.
>  # {*}Recover Stale Jobs{*}: While validating the jobs, any job(s) whose 
> leaseUpdatedStamp is not updated within the threshold, the node owning that 
> job is considered as stale/terminated and all the jobs owned by that node 
> will be released.
>  # {*}Logic to release stale jobs{*}:
> Jobs that are in Queued or Running status having runByInstanceId assigned and 
> leaseUpdatedStamp exceeds threshold will be moved to Pending status with 
> runByInstanceId set as Null. With this, other active nodes will consider such 
> job(s) as available and will pick them for further execution.
> *Sample Workflow with Example*
> 10:00 AM – Node A picks Job X.
> 10:05 AM – Node A renews its lease for Job X.
> 10:10 AM – Node A renews the lease again.
> 10:11 AM – Node A crashes.
> Job Poller Recovery Process:
> 10:08 AM: Lease validation occurs. Node A is still considered active (because 
> the expiry threshold hasn't been reached yet).
> 10:16 AM: Lease validation runs again. Node A is still considered active 
> (because the expiry threshold hasn't been reached yet).
> 10:24 AM: Lease validation occurs. Node A is now dead because the expiry 
> threshold (10 minutes) has passed.
> The Job Poller will free the job assigned to Node A, and it will now be 
> available for another active node to pick.
> Workflow Diagram:
> Refer Stale-Jobs-Management (1).png



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to