Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape

Simon Riggs Mon, 13 Jul 2015 15:39:30 -0700

On 13 July 2015 at 17:00, Tom Lane <t...@sss.pgh.pa.us> wrote:

> I wrote:
> > TBH, I think the right thing to do at this point is to revert the entire
> > patch and send it back for ground-up rework.  I think the high-level
> > design is wrong in many ways and I have about zero confidence in most
> > of the code details as well.
>
> > I'll send a separate message about high-level issues,
>
> And here's that.  I do *not* claim that this is a complete list of design
> issues with the patch, it's just things I happened to notice in the amount
> of time I've spent so far (which is already way more than I wanted to
> spend on TABLESAMPLE right now).
>


Interesting that a time-based sample does indeed yield useful results. Good
to know.

That is exactly what this patch provides: a time-based sample, with reduced
confidence in the accuracy of the result. And other things too.


> I'm not sure that we need an extensible sampling-method capability at all,
> let alone that an API for that needs to be the primary focus of a patch.
> Certainly the offered examples of extension modules aren't inspiring:
> tsm_system_time is broken by design (more about that below) and nobody
> would miss tsm_system_rows if it weren't there.


Time based sampling isn't broken by design. It works exactly according to
the design.

Time-based sampling has been specifically requested by data visualization
developers who are trying to work out how to display anything useful from a
database beyond a certain size. The feature for PostgreSQL implements a
similar mechanism to that deployed already with BlinkDB, demonstrated at
VLDB 2012. I regard it as an Advanced feature worthy of deployment within
PostgreSQL, based upon user request.

Row based sampling offers the capability to retrieve a sample of a fixed
size however big the data set. I shouldn't need to point out that this is
already in use within the ANALYZE command. Since it implements a technique
already in use within PostgreSQL, I see no reason not to expose that to
users. As I'm sure you're aware, there is rigorous math backing up the use
of fixed size sampling, with recent developments from my research
colleagues at Manchester University that further emphasises the usefulness
of this feature for us.


> What is far more
> important is to get sampling capability into indexscans, and designing
> an API ahead of time seems like mostly a distraction from that.
>

This has come up as a blocker on TABLESAMPLE before. I've got no evidence
to show anyone cares about that. I can't imagine a use for it myself; it
was not omitted from this patch because its difficult, it was omitted
because its just useless. If anyone ever cares, they can add it in a later
release.


> I'd think seriously about tossing the entire executor-stage part of the
> patch, creating a stateless (history-independent) sampling filter that
> just accepts or rejects TIDs, and sticking calls to that into all the
> table scan node types.  Once you had something like that working well
> it might be worth considering whether to try to expose an API to
> generalize it.  But even then it's not clear that we really need any
> more options than true-Bernoulli and block-level sampling.
>

See above.


> The IBM paper I linked to in the other thread mentions that their
> optimizer will sometimes choose to do Bernoulli sampling even if SYSTEM
> was requested.


That sounds broken to me. This patch gives the user what the user asks for.
BERNOULLI or SYSTEM.

If you particularly like the idea of mixing the two sampling methods, you
can do so *because* the sampling method API is extensible for the user. So
Mr.DB2 can get it his way if he likes and he can call it SYSNOULLI if he
likes; others can get it the way they like also. No argument required.

It was very, very obvious that whatever sampling code anybody dreamt up
would be objected to. Clearly, we need a way to allow people to implement
whatever technique they wish. Stratified sampling, modified sampling, new
techniques. Whatever.


> Probably that happens when it decides to do a simple
> indexscan, because then there's no advantage to trying to cluster the
> sampled rows.  But in the case of a bitmap scan, you could very easily do
> either true Bernoulli or block-level sampling simply by adjusting the
> rules about which bits you keep or clear in the bitmap (given that you
> apply the filter between collection of the index bitmap and accessing the
> heap, which seems natural).  The only case where a special scan type
> really seems to be needed is if you want block-level sampling, the query
> would otherwise use a seqscan, *and* the sampling percentage is pretty low
> --- if you'd be reading every second or third block anyway, you're likely
> better off with a plain seqscan so that the kernel sees sequential access
> and does prefetching.  The current API doesn't seem to make it possible to
> switch between seqscan and read-only-selected-blocks based on the sampling
> percentage, but I think that could be an interesting optimization.
> (Another bet that's been missed is having the samplescan logic request
> prefetching when it is doing selective block reads.  The current API can't
> support that AFAICS, since there's no expectation that nextblock calls
> could be done asynchronously from nexttuple calls.)
>

Again, you like it that way then write a plugin that way. That's why its
extensible.


> Another issue with the API as designed is the examinetuple() callback.
> Allowing sample methods to see invisible tuples is quite possibly the
> worst idea in the whole patch.  They can *not* do anything with such
> tuples, or they'll totally break reproducibility: if the tuple is
> invisible to your scan, it might well be or soon become invisible to
> everybody, whereupon it would be subject to garbage collection at the
> drop of a hat.  So if an invisible tuple affects the sample method's
> behavior at all, repeated scans in the same query would not produce
> identical results, which as mentioned before is required both by spec
> and for minimally sane query behavior.  Moreover, examining the contents
> of the tuple is unsafe (it might contain pointers to TOAST values that
> no longer exist); and even if it were safe, what's the point?  Sampling
> that pays attention to the data is the very definition of biased.  So
> if we do re-introduce an API like the current one, I'd definitely lose
> this bit and only allow sample methods to consider visible tuples.
>

That looks like it might be a valid technical point. I'll think more on
that.


> On the point of reproducibility: the tsm_system_time method is utterly
> incapable of producing identical results across repeated scans, because
> there is no reason to believe it would get exactly as far in the same
> amount of time each time.  This might be all right across queries if
> the method could refuse to work with REPEATABLE clauses (but there's
> no provision for such a restriction in the current API).  But it's not
> acceptable within a query.  Another problem with tsm_system_time is that
> its cost/rowcount estimation is based on nothing more than wishful
> thinking, and can never be based on anything more than wishful thinking,
> because planner cost units are not time-based.  Adding a comment that
> says we'll nonetheless pretend they are milliseconds isn't a solution.
> So that sampling method really has to go away and never come back,
> whatever we might ultimately salvage from this patch otherwise.
>

REPEATABLE clearly provides its meaning within the physical limits of the
approach chosen.

A time-based sample never could produce an immutable result. No further
discussion needed.

Perhaps we should document that the use of REPEATABLE is at best a STABLE
function, since the location of physical rows could have changed within the
table between executions. So really we should be documenting the fairly
hazy meaning in all cases. If you want to have a truly repeatable sample,
save the result into another table.


> (I'm not exactly convinced that the system or tsm_system_rows methods
> are adequately reproducible either, given that their sampling pattern
> will change when the relation block count changes.  Perhaps that's
> unavoidable, but it seems like it might be possible to define things
> in such a way that adding blocks doesn't change which existing blocks
> get sampled.)
>
> A more localized issue that I noticed is that nowhere is it documented
> what the REPEATABLE parameter value is.


So, we have a minor flaw in the docs. Thanks for the report.


> Digging in the code eventually
> reveals that the value is assignment-coerced to an int4, which I find
> rather problematic because a user might reasonably assume that the
> parameter works like setseed's parameter (float8 in the range -1 to 1).
> If he does then he'll get no errors and identical samples from say
> REPEATABLE(0.1) and REPEATABLE(0.2), which is bad.  On the other hand,
> it looks like DB2 allows integer values, so implementing it just like
> setseed might cause problems for people coming from DB2.  I'm inclined to
> suggest that we should define the parameter as being any float8 value,
> and obtain a seed from it with hashfloat8().  That way, no matter whether
> users think that usable values are fractional or integral, they'll get
> sane behavior with different supplied seeds almost always producing
> different samples.


Sounds like some good ideas in that part. The main reason for allowing real
numbers is that its fairly common to want to specify a sample size smaller
than a 1% sample, which as you mention above probably wouldn't be very much
faster than a SeqScan. Not something worth following DB2 on.

Your main points about how you would do sampling differently show me that
we were right to pursue an extensible approach. It's hardly a radical
thought given FDWs, custom scans and other APIs.

PostgreSQL is lucky enough to have many users with databases larger than 1
TB. We need to understand that they quite like the idea of running a short
query that uses few resources to get an approximate answer. Just look at
all the Wiki pages and blogs on this subject. Exact query techniques are
important, but so is sampling.

In terms of the patch, we've waited approximately a decade for TABLESAMPLE.
I don't see see any reason to wait longer, based upon the arguments
presented here. I accept there may be technical points that need to be
addressed, I will look at those tomorrow.

We have a choice here. Please understand that an extension that allows
block sampling is already available and works well. This patch is not
required for some evil plan, it simply allows PostgreSQL users to gain the
benefit of some cool and useful features. The fact that both of us can cite
details of other implementations means we're overdue for this and I'm happy
its in 9.5

-- 
Simon Riggs                http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] TABLESAMPLE patch is really in pretty sad shape

Reply via email to