ephraimbuddy edited a comment on pull request #17846: URL: https://github.com/apache/airflow/pull/17846#issuecomment-906810368
> This doesn't handle the case where a scheduler gets SIGKILL'd or hits a power-off event. > I have fixed that now...what do you think? > I'm not sure I understand why having the 'mark schedulers who haven't heartbeated recently enough as failed' action happen at the `adopt_or_reset_orphaned_tasks` interval is a problem? Can you expand on that? Sorry for not the late reply. Initially, I thought that to be the problem but it's not, however, I think keeping it separate is better and I have fixed the sigkill/sigterms. No extra tests were added for now. The real problem as it turned out was that the processor manager kills LocalTaskJob when it detects a zombie but that was because, in the adoption code, the old LocalTaskJob is not marked as failed after resetting the task thereby making it a zombie. I resolved that by failing the LocalTaskJob when the task is reset. It no longer happens but extra tests are needed and I will add them and also update the commit message EDIT Just found that I disabled _find_zombie and talk all is well :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
