Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Jim C. Nasby Fri, 06 Jan 2006 08:09:34 -0800

On Fri, Jan 06, 2006 at 01:24:41AM -0500, Greg Stark wrote:
> > 5% based on block-based sampling is reasonable; that means a straight 5% of 
> > the on-disk size of the table, so 5gb for a 100gb table.  With random-row 
> > sampling, that would require as much as 25% of the table, making it easier 
> > to just scan the whole thing.
> 
> Postgres's current sample sizes are clearly geared towards the histograms
> where they are entirely realistic. All of the distinct estimates are clearly
> just ad hoc attempts based on the existing sampling. 
> 
> Is a mechanism that is only 5x faster than reading the whole table (assuming
> random_page_cost of 4) and is off by more than a factor of three 10% of the
> time really worth it?


Before we start debating merits of proposals based on random reads, can
someone confirm that the sampling code actually does read randomly? I
looked at it yesterday; there is a comment that states that blocks to be
scanned are passed to the analyze function in physical order, and AFAICT
the function that chooses blocks does so based strictly on applying a
probability function to block numbers as it increments a counter. It
seems that any reading is actually sequential and not random, which
makes all the random_page_cost hand-waving null and void.
-- 
Jim C. Nasby, Sr. Engineering Consultant      [EMAIL PROTECTED]
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

Reply via email to