[
https://issues.apache.org/jira/browse/OFBIZ-13383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078880#comment-18078880
]
Ankit Joshi commented on OFBIZ-13383:
-------------------------------------
Hello [~nmalin] ,
Thanks for your review and suggesting the alternate proposal on this critical
task. The concept of utilizing orchestrator sounds good to me specifically for
the case when there is large number of jobs running on an individual server and
validating them one by one might results in excessive DB hits.
With this alternative approach, I am just thinking the following:
# Each instance will update their heartbeat (JopPoller.lastRunningDate) with
each pool loop (in every 10 seconds).
# The orchestrator logic will execute in each loop as well that will scan *all
active nodes* to determine:
## which nodes are alive
## which node should take the orchestrator role
## reset jobs for stale nodes
# The point #2 could result in DB load and Job Poller overhead specifically
when there are large number of nodes running.
# Shift towards centralized orchestrator design. Additional logic will be
needed to check/elect the orchestrator if not exist.
Possible considerations:
# The concept of minimum threshold configuration could still be relevant here
that will validate if the instance has update their state (lastRunningDateTime)
in the given timeframe and reset jobs if the condition fails.
# Use DB level atomicity to update the orchestrator and job statuses to avoid
concurrency updates.
# Custom time duration (other than 10 seconds) could be configured to execute
orchestrator logic (like 1 min, 2 min, 5 min etc) to reduce the frequent DB
checks or updates.
Looking forward to thoughts on this.
> Handle and resume stale jobs in case of application node(s) running with
> Auto-scaling enabled
> ---------------------------------------------------------------------------------------------
>
> Key: OFBIZ-13383
> URL: https://issues.apache.org/jira/browse/OFBIZ-13383
> Project: OFBiz
> Issue Type: Task
> Reporter: Ankit Joshi
> Assignee: Ankit Joshi
> Priority: Major
> Attachments: Stale-Jobs-Management (1).png
>
>
> {*}Scenario{*}:
> In an OFBiz environment running with multiple auto-scaling nodes, individual
> nodes may be restarted, replaced, or terminated due to ASG relaunch, health
> failure or unexpected node termination. When such a node goes down, the
> JobSandbox.runByInstanceId still holds the old node's Instance ID, causing
> the jobs assigned to that node to remain stuck. This prevents other active
> nodes from re-picking the jobs because the runByInstanceId still points to
> the inactive node.
> {*}Issue{*}:
> * Jobs remain assigned to the old, inactive node.
> * Jobs got stuck as no active node can pick these jobs considered them as
> owned by the old node (runByInstanceId=oldNode).
> *Implementation Plan to address this issue*
> {*}Configuration settings{*}. (at the Service-Engine thread pool level)
> * lease-refresh-millis -> 300000 (in millisecs) (every 5 minutes interval)
> * lease-validation-millis -> 480000 (in millisecs) (every 8 minutes interval)
> * lease-expiry-millis -> 600000 (leaseUpdateStamp should not be > 10 minutes
> from now())
> # A new custom field *leaseUpdatedStamp* will be introduced at the
> individual job level (JobSandbox.leaseUpdatedStamp) that will tracks last
> time when the node was still alive and actively processing their jobs.
> # {*}Heartbeat / Lease Renewal{*}: At a configured interval, every node will
> update the leaseUpdatedStamp with *now()* for the all the jobs it owned.
> # {*}Lease Expiry Validation{*}: All active nodes will periodically validate
> all the running jobs to ensure if their leaseUpdatedStamp is being updated
> within the configured threshold.
> # {*}Recover Stale Jobs{*}: While validating the jobs, any job(s) whose
> leaseUpdatedStamp is not updated within the threshold, the node owning that
> job is considered as stale/terminated and all the jobs owned by that node
> will be released.
> # {*}Logic to release stale jobs{*}:
> Jobs that are in Queued or Running status having runByInstanceId assigned and
> leaseUpdatedStamp exceeds threshold will be moved to Pending status with
> runByInstanceId set as Null. With this, other active nodes will consider such
> job(s) as available and will pick them for further execution.
> *Sample Workflow with Example*
> 10:00 AM – Node A picks Job X.
> 10:05 AM – Node A renews its lease for Job X.
> 10:10 AM – Node A renews the lease again.
> 10:11 AM – Node A crashes.
> Job Poller Recovery Process:
> 10:08 AM: Lease validation occurs. Node A is still considered active (because
> the expiry threshold hasn't been reached yet).
> 10:16 AM: Lease validation runs again. Node A is still considered active
> (because the expiry threshold hasn't been reached yet).
> 10:24 AM: Lease validation occurs. Node A is now dead because the expiry
> threshold (10 minutes) has passed.
> The Job Poller will free the job assigned to Node A, and it will now be
> available for another active node to pick.
> Workflow Diagram:
> Refer Stale-Jobs-Management (1).png
--
This message was sent by Atlassian Jira
(v8.20.10#820010)