On Fri, Oct 10, 2014 at 10:53 AM, Tomas Vondra <t...@fuzzy.cz> wrote:
> On 10.10.2014 14:10, Tomas Vondra wrote: > > Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a): > >> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <j...@agliodbs.com> wrote: > >>> Yes, it's only intractable if you're wedded to the idea of a tiny, > >>> fixed-size sample. If we're allowed to sample, say, 1% of the table, > we > >>> can get a MUCH more accurate n_distinct estimate using multiple > >>> algorithms, of which HLL is one. While n_distinct will still have some > >>> variance, it'll be over a much smaller range. > >> > >> I've gone looking for papers on this topic but from what I read this > >> isn't so. To get any noticeable improvement you need to read 10-50% of > >> the table and that's effectively the same as reading the entire table > >> -- and it still had pretty poor results. All the research I could find > >> went into how to analyze the whole table while using a reasonable > >> amount of scratch space and how to do it incrementally. > > > > I think it's really difficult to discuss the estimation without some > basic > > agreement on what are the goals. Naturally, we can't get a perfect > > estimator with small samples (especially when the sample size is fixed > and > > not scaling with the table). But maybe we can improve the estimates > > without scanning most of the table? > > > > FWIW I've been playing with the adaptive estimator described in [1] and > > the results looks really interesting, IMHO. So far I was testing it on > > synthetic datasets outside the database, but I plan to use it instead of > > our estimator, and do some more tests. > > Attached is an experimental patch implementing the adaptive estimator. > > It was fairly simple (although it's a bit messy). It only computes the > estimates for the "scalar" case (i.e. data types that we can sort). > Implementing this for the "minimal" case is possible, but requires a bit > more work. > > It only computes the estimate and prints a WARNING with both the current > and new estimate, but the old estimate is stored. > When I run this patch on the regression database, I get a case where the current method is exact but the adaptive one is off: WARNING: ndistinct estimate current=676.00 adaptive=906.00 select count(distinct stringu1) from onek; 676 It should be seeing every single row, so I don't know why the adaptive method is off. Seems like a bug. Cheers, Jeff