Thanks gil for providing insights about your experience with this issue, which reflects its relevancy and frequency of encountering.
I've created OFBIZ-13383 <https://issues.apache.org/jira/browse/OFBIZ-13383> to further share the implementation plan and related details there. Thanks & Regards, Ankit Joshi On Fri, Apr 10, 2026 at 7:18 PM gil.portenseigne < [email protected]> wrote: > Hello Ankit, > > We also met the issue and currently solved it using pre-stop kubernetes > feature, for a pod to clean everything that is running before stoping, > using a shell script and sql. > > It is effective for the big part, but sometimes it happens that an > instance took a new job at the end of the pre-stop script. > > We added a way for our instance to ask what are the pod ids that are > currently running to clean those remaining jobs. Now everything is ok, > but not so clean. > > I'm please to read your ideas to solve this issue, and i think it goes > the good way. > > Nice one, thanks ! > > [...] > > > As a *next* step, I think the out-of-the-box Job Poller should *itself > *be > > able to validate and handle such stale jobs and re-assign them to the > other > > active node for further processing. For this, I propose implementing a > *Lease > > + Heartbeat based job ownership *approach could be helpful here. This > > validation method will include 3 steps: > > > > *#1 Assigning the node as the Job owner * > > -- Assign the individual node identiifer (instance-id) as the owner for > all > > jobs it is running (*runByInstanceId*) along with a new custom field ( > > *JobSandbox.leaseUpdatedStamp*) that will help the Job poller track the > > last time the lease was updated by the node, confirming the node was > still > > active at that time. > > > > *#2 Heartbeat / Lease Renewal* > > -- At a configured interval, the Job Poller running on each node will > > update the lease timestamp for the open/in-progress jobs that the node > > currently owns. > > > > *#3 Lease Expiry Validation* > > The JobPoller running on each active node will also periodically validate > > whether all the jobs owned by that node itself are actively updating > their > > heartbeat within the specified threshold. Any job that fails to update > its > > heartbeat within the given threshold will be considered owned by a stale > > node and will be eligible for recovery. Job poller will release such > stale > > jobs identified, making them available for other active nodes to pick. > > > > *Proposed time frequency/intervals:* > > *-- **Lease update* Interval*: **every 5 minutes* > > *-- Lease Expiry* Threshold: *10 minutes* > > *-- Lease Expiry validation* : every *8 minutes* > > > > *Points to consider* > > -- Each node should have unique node identifier (runByInstanceId) that > will > > help to track/validate aliveness for each individual node. > > -- The time intervals suggested above could also be added as a > configurable > > option via data. > > > > Looking forward to valuable thoughts on it. I'll create a Jira ticket for > > this and will update the details there according to the inputs. > > > > Thanks & Regards, > > Ankit Joshi >
