Re: [HACKERS] [PATCH] New command to monitor progression of long running queries

Remi Colinet Wed, 19 Apr 2017 07:47:04 -0700

Following on previous email....

I have added below some use cases which I find very relevant when we need
to know the progress of a SQL query.
The command can be used by any SQL query (select, update, delete, insert).



The tables used have been created with :

CREATE TABLE T_1M (id integer, md5 text);
INSERT INTO T_1M SELECT generate_series(1,1000000) AS id,
md5(random()::text) AS md5;

CREATE TABLE T_10M ( id integer, md5 text);
INSERT INTO T_10M SELECT generate_series(1,10000000) AS id,
md5(random()::text) AS md5;


All the different leaf node types are implemented.



1/ Parallel select with sort (no index)
===========================

=> Terminal running the long SQL query:
test=# select * from t_10M order by md5;

=> Terminal monitoring SQL query progression:
test=# select pid,query from pg_stat_activity ;
  pid  |                  query
-------+------------------------------------------
  8062 |
  8065 |
 19605 | select pid,query from pg_stat_activity ;
 20830 | select * from t_10M order by md5;
 20832 | select * from t_10M order by md5;
 20833 | select * from t_10M order by md5;
  8060 |
  8059 |
  8061 |
(9 rows)

test=# PROGRESS 20830
test-# ;
                             PLAN PROGRESS
------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> dumping tuples to tapes / merging tapes
         rows r/w merge 2167923/2167908 rows r/w effective 0/3514320 0%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=#
test=#
test=# PROGRESS 20830;
                               PLAN PROGRESS
----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4707198/4691167 rows r/w effective 16016/3514320 0%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN
PROGRESS
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4809857/4691167 rows r/w effective 118675/3514320 3%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN
PROGRESS
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4883715/4691167 rows r/w effective 192533/3514320 5%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN
PROGRESS
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 4948381/4691167 rows r/w effective 257199/3514320 7%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN
PROGRESS
-----------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 5022137/4691167 rows r/w effective 330955/3514320 9%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN
PROGRESS
------------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 5079083/4691167 rows r/w effective 387901/3514320
11%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
                                PLAN
PROGRESS
------------------------------------------------------------------------------
 Gather Merge
   ->  Sort=> final merge sort on tapes
         rows r/w merge 5144499/4691167 rows r/w effective 453317/3514320
12%
         Sort Key: md5
         ->  Parallel Seq Scan on t_10m => rows 3514321/4166700 84%
(5 rows)

test=# PROGRESS 20830;
    PLAN PROGRESS
----------------------
 <out of transaction>
(1 row)

test=#

SQL query was interrupted before completion


2/ Insert into table
=============

=> Terminal running the long SQL query:
test=# INSERT INTO T_10M SELECT generate_series(10000001, 12000000)  AS id,
md5(random()::text) AS md5;

=> Terminal monitoring SQL query progression:
test=# PROGRESS 20830;
    PLAN PROGRESS
----------------------
 <out of transaction>
(1 row)

test=#
test=# PROGRESS 20830;
     PLAN PROGRESS
-----------------------
 Insert => rows 718161
   ->  ProjectSet
         ->  Result
(3 rows)

test=# PROGRESS 20830;
     PLAN PROGRESS
------------------------
 Insert => rows 1370255
   ->  ProjectSet
         ->  Result
(3 rows)

test=# PROGRESS 20830;
     PLAN PROGRESS
------------------------
 Insert => rows 1916731
   ->  ProjectSet
         ->  Result
(3 rows)

test=# PROGRESS 20830;
 PLAN PROGRESS
----------------
 <idle backend>
(1 row)

test=#


3/ Delete with like clause
===================

=> Terminal running the long SQL query:
test=# DELETE FROM T_10M WHERE md5 like '%cb%';

=> Terminal monitoring SQL query progression:
test=# PROGRESS 20830;
 PLAN PROGRESS
----------------
 <idle backend>
(1 row)

test=# PROGRESS 20830;
                            PLAN PROGRESS
----------------------------------------------------------------------
 Delete => rows 91906
   ->  Seq Scan on t_10m => rows 91906/848485 10% blks 6668/100000 6%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS
-------------------------------------------------------------------------
 Delete => rows 151900
   ->  Seq Scan on t_10m => rows 151900/848485 17% blks 11019/100000 11%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS
-------------------------------------------------------------------------
 Delete => rows 309533
   ->  Seq Scan on t_10m => rows 309533/848485 36% blks 22471/100000 22%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS
-------------------------------------------------------------------------
 Delete => rows 705968
   ->  Seq Scan on t_10m => rows 705968/848485 83% blks 51274/100000 51%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                              PLAN PROGRESS
--------------------------------------------------------------------------
 Delete => rows 913843
   ->  Seq Scan on t_10m => rows 913843/848485 107% blks 66417/100000 66%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
                               PLAN PROGRESS
---------------------------------------------------------------------------
 Delete => rows 1113104
   ->  Seq Scan on t_10m => rows 1113104/848485 131% blks 80881/100000 80%
         Filter: (md5 ~~ '%cb%'::text)
(3 rows)

test=# PROGRESS 20830;
 PLAN PROGRESS
----------------
 <idle backend>
(1 row)

test=#

Above monitoring report shows:
- The Seq Scan node with the number of rows scanned, the number of blocks
scanned/read.
- The Delete node with the number of rows deleted.


4/ Select with offset and limit clause
===========================

=> Terminal running the long SQL query:
test=#select * from t_10M order by md5 offset 80 limit 10;

=> Terminal monitoring SQL query progression:
test=# \watch PROGRESS 20830;
Wed 19 Apr 2017 04:36:16 PM CEST (every 1s)

 PLAN PROGRESS
----------------
 <idle backend>
(1 row)

Wed 19 Apr 2017 04:36:17 PM CEST (every 1s)

 PLAN PROGRESS
----------------
 <idle backend>
(1 row)

                 Wed 19 Apr 2017 04:36:18 PM CEST (every 1s)

                                PLAN
PROGRESS
------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 174392/11586584 1% blks 1640/100000
1%
(4 rows)

...

                                  PLAN
PROGRESS
----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 1656828/11586584 14% blks
15600/100000 15%
(4 rows)

...

                                  PLAN
PROGRESS
----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 4954207/11586584 42% blks
46640/100000 46%
(4 rows)

                   Wed 19 Apr 2017 04:36:35 PM CEST (every 1s)

...

                                  PLAN
PROGRESS
----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 7837687/11586584 67% blks
73772/100000 73%
(4 rows)

                   Wed 19 Apr 2017 04:36:41 PM CEST (every 1s)

...


                                   PLAN
PROGRESS
-----------------------------------------------------------------------------------
 Limit   => offset 0% limit 0%
   ->  Sort=> loading tuples in memory 90
         Sort Key: md5
         ->  Seq Scan on t_10m => rows 10378786/11586584 89% blks
97690/100000 97%
(4 rows)

Wed 19 Apr 2017 04:36:49 PM CEST (every 1s)

 PLAN PROGRESS
----------------
 <idle backend>
(1 row)


5/ Sample scan
=============

=> Terminal running the long SQL query:
# select * from t_10m tablesample system(50);

=> Terminal monitoring SQL query progression:
 PLAN PROGRESS
----------------
 <idle backend>
(1 row)

              Wed 19 Apr 2017 04:44:12 PM CEST (every 1s)

                             PLAN PROGRESS
-----------------------------------------------------------------------
 Sample Scan on t_10m => rows 783274/5793292 13% blks 14616/100000 14%
   Sampling: system ('50'::real)
(2 rows)

              Wed 19 Apr 2017 04:44:13 PM CEST (every 1s)

                             PLAN PROGRESS
------------------------------------------------------------------------
 Sample Scan on t_10m => rows 2514675/5793292 43% blks 47076/100000 47%
   Sampling: system ('50'::real)
(2 rows)

              Wed 19 Apr 2017 04:44:14 PM CEST (every 1s)

                             PLAN PROGRESS
------------------------------------------------------------------------
 Sample Scan on t_10m => rows 4031400/5793292 69% blks 75625/100000 75%
   Sampling: system ('50'::real)
(2 rows)

Wed 19 Apr 2017 04:44:15 PM CEST (every 1s)

 PLAN PROGRESS
----------------
 <idle backend>
(1 row)


2017-04-19 16:13 GMT+02:00 Remi Colinet <remi.coli...@gmail.com>:

> Maksim,
>
>
> 2017-04-18 20:31 GMT+02:00 Maksim Milyutin <m.milyu...@postgrespro.ru>:
>
>> On 18.04.2017 17:39, Remi Colinet wrote:
>>
>>> Hello Maksim,
>>>
>>> The core implementation I suggested for the new PROGRESS command uses
>>> different functions from the one used by EXPLAIN for its core
>>> implementation.
>>> Some source code is shared with EXPLAIN command. But this shared code is
>>> only related to quals, properties, children, subPlans and few other
>>> nodes.
>>>
>>> All other code for PROGRESS is new code.
>>>
>>> I don't believe explain.c code can be fully shared with the one of the
>>> new PROGRESS command. These 2 commands have different purposes.
>>> The core code used for the new PROGRESS command is very different from
>>> the core code used for EXPLAIN.
>>>
>>>
>> Perhaps you will be forced to duplicate significant snippets of
>> functionality from explain.c into your progress.c.
>>
>
> Currently, few code is duplicated between EXPLAIN and PROGRESS commands.
> The duplicated code could be moved to file src/backend/commands/report.c
> which is used to gather shared code between the 2 commands. I will try to
> complete this code sharing as much as possible.
>
> The main point is that PROGRESS uses the same design pattern as EXPLAIN by
> parsing the query tree. The work horse of the PROGRESS command is
> ProgressNode() which calls recursively sub nodes until we reach leaf nodes
> such as SeqScan, IndexScan, TupleStore, Sort, Material, ... . EXPLAIN
> command uses a similar work horse with function ExplainNode() which
> eventually calls different leaf nodes.
>
> Some of the leaf nodes which are common to the 2 commands have been put in
> the file src/backend/commands/report.c. May be some further code sharing is
> also possible for the work horse by using a template function which would
> call EXPLAIN specific leaf node functions or PROGRESS specific leaf node
> functions.
>
>
>>
>>
>>> Regarding the queryDesc state of SQL query upon receiving a request to
>>> report its execution progress, it does not bring any issue. The request
>>> is noted when the signal is received by the monitored backend. Then, the
>>> backend continues its execution code path. When interrupts are checked
>>> in the executor code, the request will be dealt.
>>>
>>>
>> Yes, interrupts are checked in the CHECK_FOR_INTERRUPTS entries.
>>
>> When the request is being dealt, the monitored backend will stop its
>>> execution and report the progress of the SQL query. Whatever is the
>>> status of the SQL query, progress.c code checks the status and report
>>> either that the SQL query does not have a valid status, or otherwise the
>>> current execution state of the SQL query.
>>>
>>> SQL query status checking is about:
>>> - idle transaction
>>> - out of transaction status
>>> - null planned statement
>>> - utility command
>>> - self monitoring
>>>
>>> Other tests can be added if needed to exclude some SQL query state. Such
>>> checking is done in void HandleProgressRequest(void).
>>> I do not see why a SQL query progression would not be possible in this
>>> context. Even when the queryDescc is NULL, we can just report a <idle
>>> transaction> output. This is currently the case with the patch suggested.
>>>
>>>
>> It's interesting question - how much the active query is in a usable
>> state on the stage of execution. Tom Lane noticed that CHECK_FOR_INTERRUPTS
>> doesn't give us 100% guarantee about full consistency [1].
>>
>
> I wonder what you mean about usable state.
>
> Currently, the code suggested tests the queryDesc pointer and all the sub
> nodes pointers in order to detect NULL pointers. When the progress report
> is collected by the backend, this backend does the collect and consequently
> does not run the query. So the query tree is not being modified. At this
> moment, whatever is the query state, we can manage to deal with its static
> state. It is only a tree which could also be just a NULL pointer in the
> most extreme case. Such case is dealt in the current code.
>
>
>>
>> So far, I've found this new command very handy. It allows to evaluate
>>> the time needed to complete a SQL query.
>>>
>>>
>> Could you explain how you get the percent of execution for nodes of plan
>> tree and overall for query?
>>
>
> The progress of execution of the query is computed as follows at 2
> different places for each leaf node type (Scan, IndexScan, Sort, Material,
> TupleStore, ...):
>
> - one place in the executor code, or in access methods code, or in sort
> utilities code, used during the execution of the SQL query in which
> following values are counted for instance: rows R/W, blocks, R/W, tapes R/W
> used for sort, tuple store R/W, ... . Some of these values are already
> computed in the current Postgresql official source code. Some other values
> had to be added and collected.
>
> - one place in the leaf function of each node type (ProgressScan(),
> ProgressSort(), ...) in which percents are computed and are then dumped
> together with raw values collected during execution, in the report. The
> dump details can be selected with the VERBOSE option of the PROGRESS
> command (For instance # PROGRESS VERBOSE $pid)
>
> For instance:
>
> 1/ read/write rows are collected when running the executor in the file
> src/backend/executor/execProcnode.c
> ============================================================
> ==================
>
> This is already implemented in the current official source tree. Nothing
> mode needs to be implemented to collect values (total rows number and
> current rows number already fetched are collected).
> The report is implemented in leaf function ProgressScanRows().
>
> 2/ read/write blocks are collected in the file src/backend/access/heap/
> heapam.c
> ==========================================================
>
> This is already implemented in the current official source tree. Nothing
> more needs to be implemented to collect values during execution.
>  The report is implemented in a leaf function ProgressScanBlks().
>
> 3/ the sort progressions are collected in the file src/backend/utils/sort/
> tuplesort.c
> ==========================================================
>
> [root@rco pg]# git diff --stat master.. src/backend/utils/sort/tuplesort.c
>  src/backend/utils/sort/tuplesort.c | 142 ++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++-------------
>  1 file changed, 127 insertions(+), 15 deletions(-)
> [root@rco pg]#
>
> New fields have been added to compute the different I/O
> (read/write/merge/...) per tapes for instance, during a sort on tape.
> The report of Sort node is implemented in leaf function ProgressSort()
>
> 4/ the tuple store progressions are computed in the file
> src/backend/utils/sort/tuplestore.c
> =================================================================
>
> [root@rco pg]# git diff --stat master.. src/backend/utils/sort/
> tuplestore.c
>  src/backend/utils/sort/tuplestore.c | 73 ++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++-----------------
>  1 file changed, 56 insertions(+), 17 deletions(-)
> [root@rco pg]#
>
> New fields have been added to collect the I/O needed for such tuple store.
> The report of TupleStore node is implemented in leaf function
> ProgressTupleStore().
>
>
> Other node types have been implemented: TidScan, IndexScan, LimitScan,
> CustomScan, Hash, ModifyTable.
> Such node may require some new fields to collect values during the SQL
> query execution.
>
> Overall, the overhead caused by new values collected during the SQL query
> execution, is very low.
> A few values need to be collected.
>
>
>> A further improvement would be to report the memory, disk and time
>>> resources used by the monitored backend. An overuse of memory, disk and
>>> time resources can prevent the SQL query to complete.
>>>
>>>
>> This functionality is somehow implemented in explain.c. You can see my
>> patch to this file [2]. I only manipulate runtime statistics (data in the
>> structure 'Instrumentation') to achieve the printing state of running query.
>>
>> I will check your patch and try to add such feature to the current patch.
> It provides a valuable hint to estimate whether a SQL query has a chance
> to complete and will not reached the resource limits.
> .
>
>>
>> 1. https://www.postgresql.org/message-id/24182.1472745492%40sss.pgh.pa.us
>> 2. https://github.com/postgrespro/pg_query_state/blob/master/ru
>> ntime_explain.patch
>>
>> Thanks for your suggestions and comments
>
> Regards
> Remi
>
>>
>> --
>> Maksim Milyutin
>> Postgres Professional: http://www.postgrespro.com
>> Russian Postgres Company
>>
>
>

Re: [HACKERS] [PATCH] New command to monitor progression of long running queries

Reply via email to