[jira] [Updated] (AIRFLOW-5690) Task heartbeat is being reported as a warning
[ https://issues.apache.org/jira/browse/AIRFLOW-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Vishwakarma updated AIRFLOW-5690: Description: This bug is on the master branch and v1-10-stable. All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} seconds. It is overloading my logging system. {{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s I think this is a case of mistaken log level. It should probably be a debug, or even a trace log level. [https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118] was: This bug is on the master branch and v1-10-stable. All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} seconds. It is overloading my logging system. {{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s I think this is a case of mistaken log level. It should probably be a debug, or even a trace log level. [https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]{{}} > Task heartbeat is being reported as a warning > - > > Key: AIRFLOW-5690 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5690 > Project: Apache Airflow > Issue Type: Bug > Components: worker >Affects Versions: 2.0.0, 1.10.6 >Reporter: Aditya Vishwakarma >Priority: Minor > > This bug is on the master branch and v1-10-stable. > All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} > seconds. It is overloading my logging system. > {{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last > heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s > I think this is a case of mistaken log level. It should probably be a debug, > or even a trace log level. > [https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (AIRFLOW-5690) Task heartbeat is being reported as a warning
Aditya Vishwakarma created AIRFLOW-5690: --- Summary: Task heartbeat is being reported as a warning Key: AIRFLOW-5690 URL: https://issues.apache.org/jira/browse/AIRFLOW-5690 Project: Apache Airflow Issue Type: Bug Components: worker Affects Versions: 2.0.0, 1.10.6 Reporter: Aditya Vishwakarma This bug is on the master branch and v1-10-stable. All my workers have logs like this occurring every `{{JOB_HEARTBEAT_SEC}}` seconds. It is overloading my logging system. `\{{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s` I think this is a case of mistaken log level. It should probably be a debug, or even a trace log level. https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (AIRFLOW-5690) Task heartbeat is being reported as a warning
[ https://issues.apache.org/jira/browse/AIRFLOW-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Vishwakarma updated AIRFLOW-5690: Description: This bug is on the master branch and v1-10-stable. All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} seconds. It is overloading my logging system. {{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s I think this is a case of mistaken log level. It should probably be a debug, or even a trace log level. [https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]{{}} was: This bug is on the master branch and v1-10-stable. All my workers have logs like this occurring every `{{JOB_HEARTBEAT_SEC}}` seconds. It is overloading my logging system. `\{{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s` I think this is a case of mistaken log level. It should probably be a debug, or even a trace log level. https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118{{}} > Task heartbeat is being reported as a warning > - > > Key: AIRFLOW-5690 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5690 > Project: Apache Airflow > Issue Type: Bug > Components: worker >Affects Versions: 2.0.0, 1.10.6 >Reporter: Aditya Vishwakarma >Priority: Minor > > This bug is on the master branch and v1-10-stable. > All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} > seconds. It is overloading my logging system. > {{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last > heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s > I think this is a case of mistaken log level. It should probably be a debug, > or even a trace log level. > [https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]{{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.
[ https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1695#comment-1695 ] Aditya Vishwakarma commented on AIRFLOW-5660: - [~dimberman] Can you tell me more about the scale testing? I am trying to run very large dags and also trying to run a lot of them. In essense I am running into scaling issues in prod. This one is one of the few issues I faced. For eg, next one on my list is the DagFileProcessor, it can take 300-500 seconds to process a large dag. I would love to be able to experiment with some performance testing framework. > Scheduler becomes unresponsive when processing large DAGs on kubernetes. > > > Key: AIRFLOW-5660 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5660 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.5 >Reporter: Aditya Vishwakarma >Assignee: Daniel Imberman >Priority: Major > > For very large dags( 10,000+) and high parallelism, the scheduling loop can > take more 5-10 minutes. > It seems that `_labels_to_key` function in kubernetes_executor loads all > tasks with a given execution date into memory. It does it for every task in > progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it > will load million tasks on every tick of the scheduler from db. > [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598] > A quick fix is to search for task in the db directly before regressing to > full scan. I can submit a PR for it. > A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, > dag_id, task_id, execution_date) somewhere, probably in the metadatabase. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.
[ https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952109#comment-16952109 ] Aditya Vishwakarma commented on AIRFLOW-5660: - PR for master: https://github.com/apache/airflow/pull/6340 > Scheduler becomes unresponsive when processing large DAGs on kubernetes. > > > Key: AIRFLOW-5660 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5660 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.5 >Reporter: Aditya Vishwakarma >Assignee: Daniel Imberman >Priority: Major > > For very large dags( 10,000+) and high parallelism, the scheduling loop can > take more 5-10 minutes. > It seems that `_labels_to_key` function in kubernetes_executor loads all > tasks with a given execution date into memory. It does it for every task in > progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it > will load million tasks on every tick of the scheduler from db. > [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598] > A quick fix is to search for task in the db directly before regressing to > full scan. I can submit a PR for it. > A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, > dag_id, task_id, execution_date) somewhere, probably in the metadatabase. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.
[ https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952101#comment-16952101 ] Aditya Vishwakarma commented on AIRFLOW-5660: - [~dimberman] That is a good plan. I am a bit confused about which branch to send PR for. I forked v1-10-test to implement the fix as this issue was critical for us. Does this one work? https://github.com/apache/airflow/pull/6338 I will also sent a PR for the master branch as well. > Scheduler becomes unresponsive when processing large DAGs on kubernetes. > > > Key: AIRFLOW-5660 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5660 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.5 >Reporter: Aditya Vishwakarma >Assignee: Daniel Imberman >Priority: Major > > For very large dags( 10,000+) and high parallelism, the scheduling loop can > take more 5-10 minutes. > It seems that `_labels_to_key` function in kubernetes_executor loads all > tasks with a given execution date into memory. It does it for every task in > progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it > will load million tasks on every tick of the scheduler from db. > [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598] > A quick fix is to search for task in the db directly before regressing to > full scan. I can submit a PR for it. > A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, > dag_id, task_id, execution_date) somewhere, probably in the metadatabase. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.
[ https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aditya Vishwakarma updated AIRFLOW-5660: Summary: Scheduler becomes unresponsive when processing large DAGs on kubernetes. (was: Scheduler becomes responsive when processing large DAGs on kubernetes.) > Scheduler becomes unresponsive when processing large DAGs on kubernetes. > > > Key: AIRFLOW-5660 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5660 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.5 >Reporter: Aditya Vishwakarma >Assignee: Daniel Imberman >Priority: Major > > For very large dags( 10,000+) and high parallelism, the scheduling loop can > take more 5-10 minutes. > It seems that `_labels_to_key` function in kubernetes_executor loads all > tasks with a given execution date into memory. It does it for every task in > progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it > will load million tasks on every tick of the scheduler from db. > [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598] > A quick fix is to search for task in the db directly before regressing to > full scan. I can submit a PR for it. > A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, > dag_id, task_id, execution_date) somewhere, probably in the metadatabase. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (AIRFLOW-5660) Scheduler becomes responsive when processing large DAGs on kubernetes.
Aditya Vishwakarma created AIRFLOW-5660: --- Summary: Scheduler becomes responsive when processing large DAGs on kubernetes. Key: AIRFLOW-5660 URL: https://issues.apache.org/jira/browse/AIRFLOW-5660 Project: Apache Airflow Issue Type: Bug Components: executor-kubernetes Affects Versions: 1.10.5 Reporter: Aditya Vishwakarma Assignee: Daniel Imberman For very large dags( 10,000+) and high parallelism, the scheduling loop can take more 5-10 minutes. It seems that `_labels_to_key` function in kubernetes_executor loads all tasks with a given execution date into memory. It does it for every task in progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it will load million tasks on every tick of the scheduler from db. [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598] A quick fix is to search for task in the db directly before regressing to full scan. I can submit a PR for it. A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, dag_id, task_id, execution_date) somewhere, probably in the metadatabase. -- This message was sent by Atlassian Jira (v8.3.4#803005)