shorrocka opened a new issue, #51448:
URL: https://github.com/apache/airflow/issues/51448

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.10.5
   
   ### What happened?
   
   Periodically, about every other day our airflow scheduler will crash after a 
dag run with the following error:
   `Process ForkProcess-35:
   Traceback (most recent call last):
     File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in 
_bootstrap
       self.run()
     File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
       self._target(*self._args, **self._kwargs)
     File 
"/data/apache-airflow/lib64/python3.12/site-packages/airflow/dag_processing/manager.py",
 line 247, in _run_processor_manager
       processor_manager.start()
     File 
"/data/apache-airflow/lib64/python3.12/site-packages/airflow/dag_processing/manager.py",
 line 489, in start
       return self._run_parsing_loop()
              ^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/data/apache-airflow/lib64/python3.12/site-packages/airflow/dag_processing/manager.py",
 line 616, in _run_parsing_loop
       self._collect_results_from_processor(processor)
     File 
"/data/apache-airflow/lib64/python3.12/site-packages/airflow/dag_processing/manager.py",
 line 1143, in _collect_results_from_processor
       if processor.result is not None:
          ^^^^^^^^^^^^^^^^
     File 
"/data/apache-airflow/lib64/python3.12/site-packages/airflow/dag_processing/processor.py",
 line 379, in result
       raise AirflowException("Tried to get the result before it's done!")
   airflow.exceptions.AirflowException: Tried to get the result before it's 
done!`
   
   This happens directly after seemingly normal scheduler events and a dag 
finishes ie. here are the preceding log events:
   
   `[2025-06-05T10:01:04.202-0400] {scheduler_job_runner.py:813} INFO - 
TaskInstance Finished: dag_id=restore_study_schema, task_id=create_schema, 
run_id=manual__2025-06-05T14:00:39.494709+00:00, map_index=-1, 
run_start_date=2025-06-05 14:01:03.157949+00:00, run_end_date=2025-06-05 
14:01:03.614029+00:00, run_duration=0.45608, state=up_for_retry, 
executor=LocalExecutor(parallelism=32), executor_state=success, try_number=3, 
max_tries=5, job_id=924122, pool=default_pool, queue=default, 
priority_weight=4, operator=MySQLExecuteQueryOperator, queued_dttm=2025-06-05 
14:01:02.534776+00:00, queued_by_job_id=922565, pid=558903
   [2025-06-05T10:01:04.202-0400] {scheduler_job_runner.py:813} INFO - 
TaskInstance Finished: dag_id=restore_study_schema, task_id=create_schema, 
run_id=manual__2025-06-05T14:00:39.495018+00:00, map_index=-1, 
run_start_date=2025-06-05 14:01:03.157949+00:00, run_end_date=2025-06-05 
14:01:03.613053+00:00, run_duration=0.455104, state=up_for_retry, 
executor=LocalExecutor(parallelism=32), executor_state=success, try_number=3, 
max_tries=5, job_id=924124, pool=default_pool, queue=default, 
priority_weight=4, operator=MySQLExecuteQueryOperator, queued_dttm=2025-06-05 
14:01:02.534776+00:00, queued_by_job_id=922565, pid=558902`
   
   Then we get a whole bunch of the following:
   
   `[2025-06-05T10:01:16.606-0400] {scheduler_job_runner.py:922} ERROR - 
Executor LocalExecutor(parallelism=32) reported that the task instance 
<TaskInstance: restore_study_schema.create_schema 
manual__2025-06-05T14:00:39.493053+00:00 [queued]> finished with state failed, 
but the task instance's state attribute is queued. Learn more: 
https://airflow.apache.org/docs/apache-airflow/stable/troubleshooting.html#task-state-changed-externally
   [2025-06-05T10:01:16.621-0400] {taskinstance.py:3315} ERROR - Executor 
LocalExecutor(parallelism=32) reported that the task instance <TaskInstance: 
restore_study_schema.create_schema manual__2025-06-05T14:00:39.493053+00:00 
[queued]> finished with state failed, but the task instance's state attribute 
is queued. Learn more: 
https://airflow.apache.org/docs/apache-airflow/stable/troubleshooting.html#task-state-changed-externally`
   
   This is then followed by repeating instances of the dagfileprocessormanager 
exiting restarting in a loop:
   
   `[2025-06-05T10:02:06.106-0400] {manager.py:280} WARNING - 
DagFileProcessorManager (PID=560642) exited with exit code -11 - re-launching
   [2025-06-05T10:02:06.110-0400] {manager.py:174} INFO - Launched 
DagFileProcessorManager with pid: 560675
   [2025-06-05T10:02:06.118-0400] {settings.py:63} INFO - Configured default 
timezone UTC
   [2025-06-05T10:02:09.262-0400] {manager.py:280} WARNING - 
DagFileProcessorManager (PID=560675) exited with exit code -11 - re-launching
   [2025-06-05T10:02:09.266-0400] {manager.py:174} INFO - Launched 
DagFileProcessorManager with pid: 560715`
   
   Before we finally get a 
   Segmentation fault (core dumped)
   
   I've tried to inspect the core but it's always cut off and doesn't give me 
any kind of actually useful information. 
   
   
   ### What you think should happen instead?
   
   _No response_
   
   ### How to reproduce
   
   We install airflow from py-pi using python 3.12 in a virtual environment. 
The scheduler and web-server have both been running in a tmux session or using 
nohup, either way the crash occurs. 
   
   ### Operating System
   
   Red Hat Enterprise Linux 9.5 (Plow)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==9.8.0
   apache-airflow-providers-apache-spark==5.0.0
   apache-airflow-providers-celery==3.10.0
   apache-airflow-providers-common-compat==1.7.0
   apache-airflow-providers-common-io==1.5.0
   apache-airflow-providers-common-sql==1.27.1
   apache-airflow-providers-docker==4.2.0
   apache-airflow-providers-elasticsearch==6.2.0
   apache-airflow-providers-fab==1.5.3
   apache-airflow-providers-ftp==3.12.2
   apache-airflow-providers-http==5.2.0
   apache-airflow-providers-imap==3.8.2
   apache-airflow-providers-postgres==6.1.0
   apache-airflow-providers-sftp==5.1.0
   apache-airflow-providers-smtp==2.0.0
   apache-airflow-providers-snowflake==6.1.0
   apache-airflow-providers-sqlite==4.0.0
   apache-airflow-providers-ssh==4.0.0
   apache-airflow-providers-trino==6.0.1
   
   ### Deployment
   
   Virtualenv installation
   
   ### Deployment details
   
   Python 3.12.5
   
   
   ### Anything else?
   
   This occurs almost every other day when we are running with only a few dags. 
   I have tried a whole manner of changing configuration settings increasing 
timeouts for zombie jobs, the timeout for dag parsing, increased postgres 
connections. 
   
   I am really not sure what is the underlying cause and couldn't find another 
instance of this issue. 
   
   Any help or guidance would be hugely appreciated. 
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to