[ 
https://issues.apache.org/jira/browse/OFBIZ-13383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079542#comment-18079542
 ] 

Jacopo Cappellato commented on OFBIZ-13383:
-------------------------------------------

[~ankit.joshi], [~nj], in my initial design that I shared with you, that is the 
basis of this ticket, race conditions are prevented by leveraging atomic 
update/select operations. [~ankit.joshi], if you have found other cases that 
could lead to data overrides, then please share them here because they should 
be addressed in the design.

For your reference, I am sharing here my initial design notes, even if some 
trivial details are different from the final version of the design.

======================================

The idea is to implement a *lease* and *heartbeat* pattern leveraging the 
following JobSandbox fields:

runByInstanceId: who currently owns the lease
lastUpdatedStamp: when the lease was last renewed

A job is considered available if (pseudocode):
{code:java}
runByInstanceId IS NULL
OR lastUpdatedStamp < NOW() - INTERVAL 'X minutes'
{code}
Each instance has two responsibilities:

1. Heartbeat (for jobs it owns)
Every N seconds (e.g. 60–180s), the instance updates all jobs it is currently 
processing:
{code:java}
UPDATE JobSandbox
SET lastUpdatedStamp = NOW()
WHERE runByInstanceId = <instance_id>;
{code}
This keeps the lease alive.

2. Cleanup (reclaim stale jobs from all instances)
Every M minutes (e.g. 10–20 minutes), the instance also runs:
{code:java}
UPDATE JobSandbox
SET runByInstanceId = NULL
WHERE runByInstanceId IS NOT NULL
  AND lastUpdatedStamp < NOW() - INTERVAL '<timeout> minutes';
{code}
This frees jobs from dead instances. All instances can run this safely because 
it is an idempotent operation and no coordination between instances is needed.

For simplicity, these two tasks, Heartbeat and Cleanup, can be integrated into 
the main loop of the OFBiz job poller/manager.

> Handle and resume stale jobs in case of application node(s) running with 
> Auto-scaling enabled
> ---------------------------------------------------------------------------------------------
>
>                 Key: OFBIZ-13383
>                 URL: https://issues.apache.org/jira/browse/OFBIZ-13383
>             Project: OFBiz
>          Issue Type: Task
>            Reporter: Ankit Joshi
>            Assignee: Ankit Joshi
>            Priority: Major
>         Attachments: Stale-Jobs-Management (1).png
>
>
> {*}Scenario{*}: 
> In an OFBiz environment running with multiple auto-scaling nodes, individual 
> nodes may be restarted, replaced, or terminated due to ASG relaunch, health 
> failure or unexpected node termination. When such a node goes down, the 
> JobSandbox.runByInstanceId still holds the old node's Instance ID, causing 
> the jobs assigned to that node to remain stuck. This prevents other active 
> nodes from re-picking the jobs because the runByInstanceId still points to 
> the inactive node.
> {*}Issue{*}:
>  * Jobs remain assigned to the old, inactive node.
>  * Jobs got stuck as no active node can pick these jobs considered them as 
> owned by the old node (runByInstanceId=oldNode).
> *Implementation Plan to address this issue*
> {*}Configuration settings{*}. (at the Service-Engine thread pool level)
>  * lease-refresh-millis -> 300000 (in millisecs) (every 5 minutes interval)
>  * lease-validation-millis -> 480000 (in millisecs) (every 8 minutes interval)
>  * lease-expiry-millis -> 600000 (leaseUpdateStamp should not be > 10 minutes 
> from now())
>  # A new custom field *leaseUpdatedStamp* will be introduced at the 
> individual job level (JobSandbox.leaseUpdatedStamp) that will tracks last 
> time when the node was still alive and actively processing their jobs.
>  # {*}Heartbeat / Lease Renewal{*}: At a configured interval, every node will 
> update the leaseUpdatedStamp with *now()* for the all the jobs it owned.
>  # {*}Lease Expiry Validation{*}: All active nodes will periodically validate 
> all the running jobs to ensure if their leaseUpdatedStamp is being updated 
> within the configured threshold.
>  # {*}Recover Stale Jobs{*}: While validating the jobs, any job(s) whose 
> leaseUpdatedStamp is not updated within the threshold, the node owning that 
> job is considered as stale/terminated and all the jobs owned by that node 
> will be released.
>  # {*}Logic to release stale jobs{*}:
> Jobs that are in Queued or Running status having runByInstanceId assigned and 
> leaseUpdatedStamp exceeds threshold will be moved to Pending status with 
> runByInstanceId set as Null. With this, other active nodes will consider such 
> job(s) as available and will pick them for further execution.
> *Sample Workflow with Example*
> 10:00 AM – Node A picks Job X.
> 10:05 AM – Node A renews its lease for Job X.
> 10:10 AM – Node A renews the lease again.
> 10:11 AM – Node A crashes.
> Job Poller Recovery Process:
> 10:08 AM: Lease validation occurs. Node A is still considered active (because 
> the expiry threshold hasn't been reached yet).
> 10:16 AM: Lease validation runs again. Node A is still considered active 
> (because the expiry threshold hasn't been reached yet).
> 10:24 AM: Lease validation occurs. Node A is now dead because the expiry 
> threshold (10 minutes) has passed.
> The Job Poller will free the job assigned to Node A, and it will now be 
> available for another active node to pick.
> Workflow Diagram:
> Refer Stale-Jobs-Management (1).png



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to