[jira] [Updated] (AIRFLOW-5690) Task heartbeat is being reported as a warning

2019-10-18 Thread Aditya Vishwakarma (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Vishwakarma updated AIRFLOW-5690:

Description: 
This bug is on the master branch and v1-10-stable.

All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} 
seconds. It is overloading my logging system.

{{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last 
heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s

I think this is a case of mistaken log level. It should probably be a debug, or 
even a trace log level.

[https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]

  was:
This bug is on the master branch and v1-10-stable.

All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} 
seconds. It is overloading my logging system.

{{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last 
heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s

I think this is a case of mistaken log level. It should probably be a debug, or 
even a trace log level.

[https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]{{}}


> Task heartbeat is being reported as a warning
> -
>
> Key: AIRFLOW-5690
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5690
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: worker
>Affects Versions: 2.0.0, 1.10.6
>Reporter: Aditya Vishwakarma
>Priority: Minor
>
> This bug is on the master branch and v1-10-stable.
> All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} 
> seconds. It is overloading my logging system.
> {{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last 
> heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s
> I think this is a case of mistaken log level. It should probably be a debug, 
> or even a trace log level.
> [https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-5690) Task heartbeat is being reported as a warning

2019-10-18 Thread Aditya Vishwakarma (Jira)
Aditya Vishwakarma created AIRFLOW-5690:
---

 Summary: Task heartbeat is being reported as a warning
 Key: AIRFLOW-5690
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5690
 Project: Apache Airflow
  Issue Type: Bug
  Components: worker
Affects Versions: 2.0.0, 1.10.6
Reporter: Aditya Vishwakarma


This bug is on the master branch and v1-10-stable.

All my workers have logs like this occurring every `{{JOB_HEARTBEAT_SEC}}` 
seconds. It is overloading my logging system.

`\{{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last 
heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s`

I think this is a case of mistaken log level. It should probably be a debug, or 
even a trace log level.

https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AIRFLOW-5690) Task heartbeat is being reported as a warning

2019-10-18 Thread Aditya Vishwakarma (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Vishwakarma updated AIRFLOW-5690:

Description: 
This bug is on the master branch and v1-10-stable.

All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} 
seconds. It is overloading my logging system.

{{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last 
heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s

I think this is a case of mistaken log level. It should probably be a debug, or 
even a trace log level.

[https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]{{}}

  was:
This bug is on the master branch and v1-10-stable.

All my workers have logs like this occurring every `{{JOB_HEARTBEAT_SEC}}` 
seconds. It is overloading my logging system.

`\{{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last 
heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s`

I think this is a case of mistaken log level. It should probably be a debug, or 
even a trace log level.

https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118{{}}


> Task heartbeat is being reported as a warning
> -
>
> Key: AIRFLOW-5690
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5690
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: worker
>Affects Versions: 2.0.0, 1.10.6
>Reporter: Aditya Vishwakarma
>Priority: Minor
>
> This bug is on the master branch and v1-10-stable.
> All my workers have logs like this occurring every {{JOB_HEARTBEAT_SEC}} 
> seconds. It is overloading my logging system.
> {{local_task_job.py:124}} WARNING [1 - MainThread] - Time since last 
> heartbeat(0.02 s) < heartrate(10.0 s), sleeping for 9.983787 s
> I think this is a case of mistaken log level. It should probably be a debug, 
> or even a trace log level.
> [https://github.com/adityav/airflow/blob/master/airflow/jobs/local_task_job.py#L118]{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1695#comment-1695
 ] 

Aditya Vishwakarma commented on AIRFLOW-5660:
-

[~dimberman] Can you tell me more about the scale testing? I am trying to run 
very large dags and also trying to run a lot of them. In essense I am running 
into scaling issues in prod. This one is one of the few issues I faced.

For eg, next one on my list is the DagFileProcessor, it can take 300-500 
seconds to process a large dag. I would love to be able to experiment with some 
performance testing framework.

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952109#comment-16952109
 ] 

Aditya Vishwakarma commented on AIRFLOW-5660:
-

PR for master: https://github.com/apache/airflow/pull/6340

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952101#comment-16952101
 ] 

Aditya Vishwakarma commented on AIRFLOW-5660:
-

[~dimberman] That is a good plan. 

I am a bit confused about which branch to send PR for. I forked  v1-10-test to 
implement the fix as this issue was critical for us. Does this one work? 
https://github.com/apache/airflow/pull/6338

I will also sent a PR for the master branch as well.

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (AIRFLOW-5660) Scheduler becomes unresponsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya Vishwakarma updated AIRFLOW-5660:

Summary: Scheduler becomes unresponsive when processing large DAGs on 
kubernetes.  (was: Scheduler becomes responsive when processing large DAGs on 
kubernetes.)

> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> 
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes
>Affects Versions: 1.10.5
>Reporter: Aditya Vishwakarma
>Assignee: Daniel Imberman
>Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can 
> take more 5-10 minutes. 
> It seems that `_labels_to_key` function in kubernetes_executor loads all 
> tasks with a given execution date into memory. It does it for every task in 
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it 
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to 
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-5660) Scheduler becomes responsive when processing large DAGs on kubernetes.

2019-10-15 Thread Aditya Vishwakarma (Jira)
Aditya Vishwakarma created AIRFLOW-5660:
---

 Summary: Scheduler becomes responsive when processing large DAGs 
on kubernetes.
 Key: AIRFLOW-5660
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
 Project: Apache Airflow
  Issue Type: Bug
  Components: executor-kubernetes
Affects Versions: 1.10.5
Reporter: Aditya Vishwakarma
Assignee: Daniel Imberman


For very large dags( 10,000+) and high parallelism, the scheduling loop can 
take more 5-10 minutes. 

It seems that `_labels_to_key` function in kubernetes_executor loads all tasks 
with a given execution date into memory. It does it for every task in progress. 
So, if 100 tasks are in progress of a dag with 10,000 tasks, it will load 
million tasks on every tick of the scheduler from db.

[https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]

A quick fix is to search for task in the db directly before regressing to full 
scan. I can submit a PR for it.

A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, 
dag_id, task_id, execution_date) somewhere, probably in the metadatabase.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)