Re: Handling stale jobs in multi-node OFBiz environment with auto-scaling enabled

Ankit Joshi Mon, 13 Apr 2026 06:08:04 -0700

Thanks gil for providing insights about your experience with this issue,
which reflects its relevancy and frequency of encountering.


I've created OFBIZ-13383 <https://issues.apache.org/jira/browse/OFBIZ-13383> to
further share the implementation plan and related details there.

Thanks & Regards,
Ankit Joshi

On Fri, Apr 10, 2026 at 7:18 PM gil.portenseigne <
[email protected]> wrote:

> Hello Ankit,
>
> We also met the issue and currently solved it using pre-stop kubernetes
> feature, for a pod to clean everything that is running before stoping,
> using a shell script and sql.
>
> It is effective for the big part, but sometimes it happens that an
> instance took a new job at the end of the pre-stop script.
>
> We added a way for our instance to ask what are the pod ids that are
> currently running to clean those remaining jobs. Now everything is ok,
> but not so clean.
>
> I'm please to read your ideas to solve this issue, and i think it goes
> the good way.
>
> Nice one, thanks !
>
> [...]
>
> > As a *next* step, I think the out-of-the-box Job Poller should *itself
> *be
> > able to validate and handle such stale jobs and re-assign them to the
> other
> > active node for further processing. For this, I propose implementing a
> *Lease
> > + Heartbeat based job ownership *approach could be helpful here. This
> > validation method will include 3 steps:
> >
> > *#1 Assigning the node as the Job owner *
> > -- Assign the individual node identiifer (instance-id) as the owner for
> all
> > jobs it is running (*runByInstanceId*) along with a new custom field (
> > *JobSandbox.leaseUpdatedStamp*) that will help the Job poller track the
> > last time the lease was updated by the node, confirming the node was
> still
> > active at that time.
> >
> > *#2 Heartbeat / Lease Renewal*
> > -- At a configured interval, the Job Poller running on each node will
> > update the lease timestamp for the open/in-progress jobs that the node
> > currently owns.
> >
> > *#3 Lease Expiry Validation*
> > The JobPoller running on each active node will also periodically validate
> > whether all the jobs owned by that node itself are actively updating
> their
> > heartbeat within the specified threshold. Any job that fails to update
> its
> > heartbeat within the given threshold will be considered owned by a stale
> > node and will be eligible for recovery. Job poller will release such
> stale
> > jobs identified, making them available for other active nodes to pick.
> >
> > *Proposed time frequency/intervals:*
> > *-- **Lease update* Interval*: **every 5 minutes*
> > *-- Lease Expiry* Threshold: *10 minutes*
> > *-- Lease Expiry validation* : every *8 minutes*
> >
> > *Points to consider*
> > -- Each node should have unique node identifier (runByInstanceId) that
> will
> > help to track/validate aliveness for each individual node.
> > -- The time intervals suggested above could also be added as a
> configurable
> > option via data.
> >
> > Looking forward to valuable thoughts on it. I'll create a Jira ticket for
> > this and will update the details there according to the inputs.
> >
> > Thanks & Regards,
> > Ankit Joshi
>

Re: Handling stale jobs in multi-node OFBiz environment with auto-scaling enabled

Reply via email to