Re: [HACKERS] CLUSTER command progress monitor

Tatsuro Yamada Sun, 10 Sep 2017 19:37:38 -0700

On 2017/09/08 18:55, Robert Haas wrote:

On Wed, Aug 30, 2017 at 10:12 PM, Tatsuro Yamada
<[email protected]> wrote:

   1. scanning heap
   2. sort tuples


These two phases overlap, though. I believe progress reporting for
sorts is really hard.  In the simple case where the data fits in
work_mem, none of the work of the sort gets done until all the data is
read.  Once you switch to an external sort, you're writing batch
files, so a lot of the work is now being done during data loading.
But as the number of batch files grows, the final merge at the end
becomes an increasingly noticeable part of the cost, and eventually
you end up needing multiple merge passes.  I think we need some smart
way to report on sorts so that we can tell how much of the work has
really been done, but I don't know how to do it.


Thanks for the comment.

As you know, CLUSTER command uses SEQ SCAN or INDEX SCAN as a scan method by
cost estimation. In the case of SEQ SCAN, these two phases not overlap.
However, in INDEX SCAN, it overlaps. Therefore I created the phase of "scan
heap and write new heap" when INDEX SCAN was selected.

I agree that progress reporting for sort is difficult. So it only reports
the phase ("sorting tuples") in the current design of progress monitor of 
cluster.
It doesn't report counter of sort.

  heap_tuples_total   | bigint  |           |          |


The patch is getting the value reported as heap_tuples_total from
OldHeap->rd_rel->reltuples.  I think this is pointless: the user can
see that value anyway if they wish.  The point of the progress
counters is to expose things the user couldn't otherwise see.  It's
also not necessarily accurate: it's only an estimate in the best case,
and may be way off if the relation has recently be extended by a large
amount.  I think it's pretty important that we try hard to only report
values that are known to be accurate, because users hate (and mock)
inaccurate progress reports.


Do you mean to use the number of rows by using below calculation instead
OldHeap->rd_rel->reltuples?

 estimate rows = physical table size / average row length

I understand that OldHeap->rd_rel->reltuples is sometimes useless because
it is correct by auto analyze and it can't perform when under a threshold.

I'll add it in next patch and also share more detailed the current design of
progress monitor for cluster.

Regards,
Tatsuro Yamada




--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CLUSTER command progress monitor

Reply via email to