[jira] [Commented] (AIRFLOW-2367) High POSTGRES DB CPU utilization

2018-04-23 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448980#comment-16448980
 ] 

John Arnold commented on AIRFLOW-2367:
--

Manual repro of the top query only takes 250ms...

> High POSTGRES DB CPU utilization
> 
>
> Key: AIRFLOW-2367
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2367
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 2.0, 1.9.0
>Reporter: John Arnold
>Priority: Major
> Attachments: cpu.png, postgres.png
>
>
> We are seeing steady state 70-90% CPU utilization.  It feels like a missing 
> index kind of problem, as our TPS rate is really low, I'm not seeing any long 
> running queries, connection counts are reasonable (low hundreds) and locks 
> also look reasonable (not many exclusive / write locks)
> We shut down the webserver and it doesn't go away, so it doesn't seem to be 
> in that part of the code. My guess is either the scheduler has an inefficient 
> query, or the (Celery) executor code path does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2367) High POSTGRES DB CPU utilization

2018-04-23 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448978#comment-16448978
 ] 

John Arnold commented on AIRFLOW-2367:
--

I believe the "top talker" query is from this query in models.py:

@provide_session
 def get_task_instances(self, state=None, session=None):
 """
 Returns the task instances for this dag run
 """
 TI = TaskInstance
 tis = session.query(TI).filter(
 TI.dag_id == self.dag_id,
 TI.execution_date == self.execution_date,
 )

> High POSTGRES DB CPU utilization
> 
>
> Key: AIRFLOW-2367
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2367
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 2.0, 1.9.0
>Reporter: John Arnold
>Priority: Major
> Attachments: cpu.png, postgres.png
>
>
> We are seeing steady state 70-90% CPU utilization.  It feels like a missing 
> index kind of problem, as our TPS rate is really low, I'm not seeing any long 
> running queries, connection counts are reasonable (low hundreds) and locks 
> also look reasonable (not many exclusive / write locks)
> We shut down the webserver and it doesn't go away, so it doesn't seem to be 
> in that part of the code. My guess is either the scheduler has an inefficient 
> query, or the (Celery) executor code path does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2367) High POSTGRES DB CPU utilization

2018-04-23 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448974#comment-16448974
 ] 

John Arnold commented on AIRFLOW-2367:
--

This insert is taking the second-most time (about 1/3 of the above):

INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) 
VALUES (?::timestamptz, ?, ?, ?, ?::timestamptz, ?, ?) RETURNING log.id

I verified there are no indexes, triggers or weird constraints that would make 
it slow, it just has a high volume.

> High POSTGRES DB CPU utilization
> 
>
> Key: AIRFLOW-2367
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2367
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 2.0, 1.9.0
>Reporter: John Arnold
>Priority: Major
> Attachments: cpu.png, postgres.png
>
>
> We are seeing steady state 70-90% CPU utilization.  It feels like a missing 
> index kind of problem, as our TPS rate is really low, I'm not seeing any long 
> running queries, connection counts are reasonable (low hundreds) and locks 
> also look reasonable (not many exclusive / write locks)
> We shut down the webserver and it doesn't go away, so it doesn't seem to be 
> in that part of the code. My guess is either the scheduler has an inefficient 
> query, or the (Celery) executor code path does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2367) High POSTGRES DB CPU utilization

2018-04-23 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448969#comment-16448969
 ] 

John Arnold commented on AIRFLOW-2367:
--

This query is taking the most time:
SELECT task_instance.try_number AS task_instance_try_number, 
task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS 
task_instance_dag_id, task_instance.execution_date AS 
task_instance_execution_date, task_instance.start_date AS 
task_instance_start_date, task_instance.end_date AS task_instance_end_date, 
task_instance.duration AS task_instance_duration, task_instance.state AS 
task_instance_state, task_instance.max_tries AS task_instance_max_tries, 
task_instance.hostname AS task_instance_hostname, task_instance.unixname AS 
task_instance_unixname, task_instance.job_id AS task_instance_job_id, 
task_instance.pool AS task_instance_pool, task_instance.queue AS 
task_instance_queue, task_instance.priority_weight AS 
task_instance_priority_weight, task_instance.operator AS 
task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, 
task_instance.pid AS task_instance_pid, task_instance.uuid AS 
task_instance_uuid 
FROM task_instance 
WHERE task_instance.dag_id = ? AND task_instance.execution_date = ?::timestamptz

> High POSTGRES DB CPU utilization
> 
>
> Key: AIRFLOW-2367
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2367
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 2.0, 1.9.0
>Reporter: John Arnold
>Priority: Major
> Attachments: cpu.png, postgres.png
>
>
> We are seeing steady state 70-90% CPU utilization.  It feels like a missing 
> index kind of problem, as our TPS rate is really low, I'm not seeing any long 
> running queries, connection counts are reasonable (low hundreds) and locks 
> also look reasonable (not many exclusive / write locks)
> We shut down the webserver and it doesn't go away, so it doesn't seem to be 
> in that part of the code. My guess is either the scheduler has an inefficient 
> query, or the (Celery) executor code path does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2367) High POSTGRES DB CPU utilization

2018-04-23 Thread John Arnold (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448953#comment-16448953
 ] 

John Arnold commented on AIRFLOW-2367:
--

[~bolke]  Any suggestions on what metrics or configuration options?  We've been 
looking over the database (top 10 queries etc) and there are no surprises that 
I can see. The top query by far is for task_instance table and all the 
conditionals are for indexed columns.  I went through basically every query in 
models.py looking for any that are using unindexed columns, and didn't find any.

I've attached a screenshot of the top 10 queries.

 

We played with our connection pool sizes, thinking that perhaps we were 
hammering the db with connections, but that didn't seem to make any difference.

> High POSTGRES DB CPU utilization
> 
>
> Key: AIRFLOW-2367
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2367
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 2.0, 1.9.0
>Reporter: John Arnold
>Priority: Major
> Attachments: cpu.png, postgres.png
>
>
> We are seeing steady state 70-90% CPU utilization.  It feels like a missing 
> index kind of problem, as our TPS rate is really low, I'm not seeing any long 
> running queries, connection counts are reasonable (low hundreds) and locks 
> also look reasonable (not many exclusive / write locks)
> We shut down the webserver and it doesn't go away, so it doesn't seem to be 
> in that part of the code. My guess is either the scheduler has an inefficient 
> query, or the (Celery) executor code path does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2367) High POSTGRES DB CPU utilization

2018-04-23 Thread Bolke de Bruin (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448812#comment-16448812
 ] 

Bolke de Bruin commented on AIRFLOW-2367:
-

You really need to provide more metrics and configuration options. The 
scheduler can be really busy when you have a lot of dags. worker memory also 
affects your performance. In other words let a DBA have a look at what you are 
doing and report back.

> High POSTGRES DB CPU utilization
> 
>
> Key: AIRFLOW-2367
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2367
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: Airflow 2.0, 1.9.0
>Reporter: John Arnold
>Priority: Major
> Attachments: cpu.png
>
>
> We are seeing steady state 70-90% CPU utilization.  It feels like a missing 
> index kind of problem, as our TPS rate is really low, I'm not seeing any long 
> running queries, connection counts are reasonable (low hundreds) and locks 
> also look reasonable (not many exclusive / write locks)
> We shut down the webserver and it doesn't go away, so it doesn't seem to be 
> in that part of the code. My guess is either the scheduler has an inefficient 
> query, or the (Celery) executor code path does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)