StewartThomson opened a new issue #17771:
URL: https://github.com/apache/airflow/issues/17771


   We're noticing high cpu usage stemming from one query. The db backend we are 
using is Mysql 8.0. The DAG has 407 tasks in it.
   
   The query is as follows:
   
   `SELECT 
   task_instance.try_number AS task_instance_try_number,
    task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS 
task_instance_dag_id, 
    task_instance.execution_date AS task_instance_execution_date, 
task_instance.start_date AS task_instance_start_date,
    task_instance.end_date AS task_instance_end_date, task_instance.duration AS 
task_instance_duration, 
    task_instance.state AS task_instance_state, task_instance.max_tries AS 
task_instance_max_tries, 
    task_instance.hostname AS task_instance_hostname, task_instance.unixname AS 
task_instance_unixname, 
    task_instance.job_id AS task_instance_job_id, task_instance.pool AS 
task_instance_pool,
    task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue 
AS task_instance_queue,
    task_instance.priority_weight AS task_instance_priority_weight,
    task_instance.operator AS task_instance_operator, task_instance.queued_dttm 
AS task_instance_queued_dttm, 
    task_instance.queued_by_job_id AS task_instance_queued_by_job_id, 
task_instance.pid AS task_instance_pid,
    task_instance.executor_config AS task_instance_executor_config,
    task_instance.external_executor_id AS task_instance_external_executor_id 
   FROM task_instance, 
        (
       SELECT task_instance.task_id AS task_id,
       max(task_instance.execution_date) AS max_ti 
        FROM task_instance USE INDEX (PRIMARY) 
        WHERE task_instance.dag_id = 'sql_dump_partition_ten_minutes' 
       AND (task_instance.state = 'success' OR task_instance.state = 'skipped') 
       AND task_instance.task_id IN ({tasks.....})
       GROUP BY task_instance.task_id
       ) AS sq 
        WHERE task_instance.dag_id = 'sql_dump_partition_ten_minutes'
         AND task_instance.task_id = sq.task_id 
        AND task_instance.execution_date = sq.max_ti;`
   
   The task ids have been removed. This takes about 2 seconds to run, with a 
dag that runs every 10 minutes. What is this query for? (I notice that removing 
the state check improves performance). 
   
   Curious of any possible solutions. Would removing old task instances to 
reduce the size of the table help? Maybe I could add the task_instance's state 
to the primary key.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to