Re: [HACKERS] ANALYZE command progress checker

vinayak Mon, 03 Apr 2017 18:30:06 -0700


On 2017/03/30 17:39, Masahiko Sawada wrote:

On Wed, Mar 29, 2017 at 5:38 PM, vinayak
<[email protected]> wrote:

On 2017/03/25 4:30, Robert Haas wrote:

On Fri, Mar 24, 2017 at 3:41 AM, vinayak
<[email protected]> wrote:

I have updated the patch.

You can't change the definition of AcquireSampleRowsFunc without
updating the documentation in fdwhandler.sgml, but I think I don't
immediately understand why that's a thing we want to do at all if
neither file_fdw nor postgres_fdw are going to use the additional
argument.  It seems to be entirely pointless because the new value
being passed down is always zero and even if the callee were to update
it, do_analyze_rel() would just ignore the change on return.  Am I
missing something here?  Also, the whitespace-only changes to fdwapi.h
related to AcquireSampleRowsFunc going to get undone by pgindent, so
even if there's some reason to change that you should leave the lines
that don't need changing untouched.

I reviewed v7 patch.


When I ran ANALYZE command to the table having 5 millions rows with 3
child tables, the progress report I got shows the strange result. The
following result was output after sampled all target rows from child
tables (i.g, after finished acquire_inherited_sample_rows). I think
the progress information should report 100% complete at this point.

#= select * from pg_stat_progress_analyze ;
   pid  | datid | datname  | relid |              phase               |
num_target_sample_rows | num_rows_sampled
-------+-------+----------+-------+----------------------------------+------------------------+------------------
  81719 | 13215 | postgres | 16440 | collecting inherited sample rows |
                3000000 |          1800000
(1 row)

I guess the cause of this is that you always count the number of
sampled rows in acquire_sample_rows staring from 0 for each child
table. num_rows_sampled is reset whenever starting collecting sample
rows.
Also, even if table has child table, the total number of sampling row
is not changed. I think that main differences between analyzing on
normal table and on partitioned table is where we fetch the sample row
from. So I'm not sure if adding "computing inherited statistics" phase
is desirable. If you want to report progress of collecting sample rows
on child table I think that you might want to add the information
showing which child table is being analyzed.

--
pg_stat_progress_analyze.num_target_sample_rows is set while ignoring
the number of rows the target table has actually. So If the table has
rows less than target number of rows, the num_rows_sampled never reach
to num_target_sample_rows.

--
@@ -1180,6 +1213,9 @@ acquire_sample_rows(Relation onerel, int elevel,
                                 }

                                 samplerows += 1;
+
+                               /* Report number of rows sampled so far */
+
pgstat_progress_update_param(PROGRESS_ANALYZE_NUM_ROWS_SAMPLED,
numrows);
                         }
                 }

I think that it's wrong to use numrows variable to report the progress
of the number of rows sampled. As the following comment mentioned, we
first save row into rows[] and increment numrows until numrows <
targrows. And then we could replace stored sample row with new sampled
row. That is, after collecting "numrows" rows, analyze can still take
a long time to get and replace the sample row. To computing progress
of collecting sample row, IMO reporting the number of blocks we
scanned so far is better than number of sample rows. Thought?

     /*
      * The first targrows sample rows are simply copied into the
      * reservoir. Then we start replacing tuples in the sample
      * until we reach the end of the relation.  This algorithm is
      * from Jeff Vitter's paper (see full citation below). It
      * works by repeatedly computing the number of tuples to skip
      * before selecting a tuple, which replaces a randomly chosen
      * element of the reservoir (current set of tuples).  At all
      * times the reservoir is a true random sample of the tuples
      * we've passed over so far, so when we fall off the end of
      * the relation we're done.
      */

Looks good to me.

In the attached patch I have reported number of blocks scanned so far instead of number of sample rows.

In the 'collecting inherited sample rows' phase, child_relid is reported as a separate column. The child_relid is reported its value only in 'collecting inherited sample rows' phase. For single relation it return 0.

I am not sure whether this column would helpful or not.
Any suggestions are welcome.

+            /* Reset rows processed to zero for the next column */
+
pgstat_progress_update_param(PROGRESS_ANALYZE_NUM_ROWS_PROCESSED,
0);

This seems like a bad design.  If you see a value other than zero in
this field, you still won't know how much progress analyze has made,
because you don't know which column you're processing.  I also feel
like you're not paying enough attention to a key point here that I
made in my last review, which is that you need to look at where
ANALYZE actually spends most of the time.  If the process of computing
the per-row statistics consumes significant CPU time, then maybe
there's some point in trying to instrument it, but does it?  Unless
you know the answer to that question, you don't know whether there's
even a point to trying to instrument this.

Understood. The computing statistics phase takes long time so I am looking
at the code.

Yes, ANALYZE spends most of time on computing statistics phase. I
measured the duration time for each phases on my environment. I set up
the table by pgbench -i -s 100 (total 10 million rows), and filled the
filler column of pgbench_accounts table with random string. And I set
default_statistics_target to 1000, and ran analyze on that table. The
analyze scanned all blocks of target table and collected 3000000
sample rows. The durations of each phase are,

* Sampling : 7 sec
* Computing statistics : 28 sec
     * aid column : 1 sec
     * bid column : 1 sec
     * abalance column : 1 sec
     * filler column : 25 sec

I'm not sure if we can calculate the meaningful progress information
in computing statistics function such as compute_scalar_stats,
compute_trivial_stats. But I think that at least showing which
statistics of column is being computed would be good information for
user.

I agree with you that showing statistics of the column is being computed would be good information but

progress reporting API is only supported for int64 values.

So I think it would be good if we report such value that will be helpful for user in computing statistics phase.



Regards,
Vinayak Pokale
NTT Open Source Software Center

pg_stat_progress_analyze_v8.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE command progress checker

Reply via email to