brigittatitisan opened a new issue, #66713:
URL: https://github.com/apache/airflow/issues/66713

   ### Under which category would you file this issue?
   
   Task SDK
   
   ### Apache Airflow version
   
   airflow 3.0.6
   
   ### What happened and how to reproduce it?
   
   During a long-running task (wall time on the order of ~12 hours in our 
case), the task process was terminated by the supervisor after the execution 
API stopped accepting heartbeats:
   - PUT /execution/task-instances/<ti_uuid>/heartbeat returned 404 with Task 
Instance not found / reason: not_found.
   - Supervisor logged that the server indicated the task should not keep 
running, then Task killed! with process SIGKILL (exit_code=-9), 
final_state=SERVER_TERMINATED.
   - Supervisor also logged duration ~ 43214 s (~12 h) for that workload 
session.
   
   In the same log window we also saw PATCH .../task-instances/<ti_uuid>/run 
return 409 Conflict with a message along the lines of cannot start task 
instance — invalid state, previous_state=running (exact try_number in our 
export was 2; we treat that as illustrative, not the core claim—we care about 
404 + forced termination after a long running execute, regardless of attempt 
index).
   
   DAG/operator architecture: The workload is a PythonOperator subclass whose 
execute() submits work to Snowflake and then blocks inside the same task on an 
imperative wait() (implemented on a BaseSensorOperator subclass: time.sleep + 
polling). There is no separate sensor task, so one TI can stay running for many 
hours.
   
   What I need help with:
   
   1. Under what conditions does the Task Execution API return 404 on heartbeat 
for a TI that was recently updating heartbeats successfully?
   2. Whether long running tasks on Celery + Airflow 3 are expected to hit 
this, and how it relates to timeouts, TI lifecycle, or DAG run / span updates 
(we also see scheduler INFO such as Found span_status 'should_end', while 
updating state for dag_run ... on the same run).
   3. Recommended patterns for multi-hour waits (deferrable operators, separate 
sensor task, etc.) under the Task SDK.
   
   Representative log excerpts:
   `PUT /execution/task-instances/<ti_uuid>/heartbeat HTTP/1.1" 404 Not Found
   Task Instance not found        ti_id=<uuid>
   Server indicated the task shouldn't be running anymore ... 
reason='not_found' message='Task Instance not found'
   Task killed!
   Task finished [supervisor] duration=43214.24... exit_code=-9 
final_state=SERVER_TERMINATED`
   
   ### What you think should happen instead?
   
   _No response_
   
   ### Operating System
   
   _No response_
   
   ### Deployment
   
   None
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   _No response_
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to