kcphila commented on issue #17507:
URL: https://github.com/apache/airflow/issues/17507#issuecomment-1162252733

   Hi all,
   
   I am experiencing this on 2.3.2 with LocalExecutor (4 schedulers), Postgres, 
and Ubuntu 22.04. 
   
   This is, however, running a clone of our staging environment of dags that 
run fine on 2.1.4 and Ubuntu 16.04.  I'm also running on a much smaller and 
less powerful instance, and so it may be exacerbating race conditions.
   
   I did some investigation into the process state, and when this error leads 
to a failure, this is what I see in process executions:
   
   - The Scheduler task is the root of everything, as you'd expect (`airflow 
scheduler`)
   - `recorded_pid` , which is assigned to be the taskinstance pid (`ti.pid`) 
normally and the parent of the taskinstance pid 
(`psutil.Process(ti.pid).ppid()`) when RUN_AS_USER is set.  When failing, this 
consistently shows up as the worker (`worker -- LocalExecutor`). This is a 
persistent and long term process. 
   - The child of the *recorded_pid* is the pid of the current process (as 
reported by `os.getpid()`), which is the airflow task supervisor. This (and 
everything below) is one of the short term task-specific processes.
   - The `current_pid` can be different things, but always appears to be the 
child of the task supervisor / current pid.  Often times this must be a 
fleeting process as I can barely catch a record of it when I'm trying to fetch 
a snapshot.  Here are a couple that I have seen:
      - In some cases, I have seen this as the task runner's pid - `airflow 
tasks run [taskname]` 
      - I have also seen this as the `airflow task su`, and the tasks are 
RUN_AS_USER, so likely related.
   
   I came to wonder, since this error happens because (a) the final 
`recorded_pid` is not None and (B) `recorded_pid` != `current_pid` - it doesn't 
make much sense to ever be comparing against the Task Instance pid since that's 
hanging around for a very long time and the heatbeat function appears to be 
identifying when the current task runner is zombified or missing.
   
   As I've investigated further, I've found on task failures for RUN_AS_USER 
tasks in which this fails, the `ti.pid` is almost invariably `None`, which 
means the `recorded_pid` comes in as `psutil.Process(None).ppid()`, which will 
be the parent of the current process. I am currently under the impression that 
this was not intended - and that the error condition should only be tested when 
`ti.pid is not None`, instead of `recorded_pid is not None`.  
   
   I'm testing this right now and it seems to work - and if that seems to hold 
up I'll put in a PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to