ephraimbuddy commented on pull request #17846:
URL: https://github.com/apache/airflow/pull/17846#issuecomment-906810368


   > This doesn't handle the case where a scheduler gets SIGKILL'd or hits a 
power-off event.
   > 
   
   I have fixed that now...what do you think?
   
   > I'm not sure I understand why having the 'mark schedulers who haven't 
heartbeated recently enough as failed' action happen at the 
`adopt_or_reset_orphaned_tasks` interval is a problem? Can you expand on that?
   
   Sorry for not the late reply. Initially, I thought that to be the problem 
but it's not, however, I think keeping it separate is better and I have fixed 
the sigkill/sigterms. No extra tests were added for now.
   
   The real problem as it turned out was that the processor manager kills 
LocalTaskJob when it detects a zombie but that was because, in the adoption 
code, the old LocalTaskJob is not marked as failed after resetting the task 
thereby making it a zombie.
   I resolved that by failing the LocalTaskJob when the task is reset. It no 
longer happens but extra tests are needed and I will add them and also update 
the commit message
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to