Re: [HACKERS] TABLESAMPLE patch

Tomas Vondra Mon, 22 Dec 2014 04:36:36 -0800

On 22.12.2014 10:07, Petr Jelinek wrote:
> On 21/12/14 18:38, Tomas Vondra wrote:
>>
>> (1) The patch adds a new catalog, but does not bump CATVERSION.
>>
> 
> I thought this was always done by committer?


Right. Sorry for the noise.

> 
>> (2) The catalog naming (pg_tablesamplemethod) seems a bit awkward,
>>      as it squishes everything into a single chunk. That's inconsistent
>>      with naming of the other catalogs. I think pg_table_sample_method
>>      would be better.
> 
> Fair point, but perhaps pg_tablesample_method then as tablesample is 
> used as single word everywhere including the standard.

Sounds good.

>> (4) I noticed there's an interesting extension in SQL Server, which
>>      allows specifying PERCENT or ROWS, so you can say
>>
>>        SELECT * FROM table TABLESAMPLE SYSTEM (25 PERCENT);
>>
>>      or
>>
>>        SELECT * FROM table TABLESAMPLE SYSTEM (2500 ROWS);
>>
>>      That seems handy, and it'd make migration from SQL Server easier.
>>      What do you think?
> 
> Well doing it exactly this way somewhat kills the extensibility which
> was one of the main goals for me - I allow any kind of parameters for
> sampling and the handling of those depends solely on the sampling
> method. This means that in my approach if you'd want to change what you
> are limiting you'd have to write new sampling method and the query would
> then look like SELECT * FROM table TABLESAMPLE SYSTEM_ROWLIMIT (2500);
> or some such (depending on how you name the sampling method). Or SELECT
> * FROM table TABLESAMPLE SYSTEM (2500, 'ROWS'); if we chose to extend
> the SYSTEM sampling method, that would be also doable.
> 
> The reason for this is that I don't want to really limit too much what
> parameters can be send to sampling (I envision even SYSTEM_TIMED
> sampling method that will get limit as time interval for example).

Good point.

>>
>> (6) This seems slightly wrong, because of long/uint32 mismatch:
>>
>>        long seed = PG_GETARG_UINT32(1);
>>
>>      I think uint32 would be more appropriate, no?
> 
> Probably, although I need long later in the algorithm anyway.

Really? ISTM the sampler_setseed() does not really require long, uint32
would work exactly the same.

>>
>> (9) The current regression tests only use the REPEATABLE cases. I
>>      understand queries without this clause are RANDOM, but maybe we
>>      could do something like this:
>>
>>         SELECT COUNT(*) BETWEEN 5000 AND 7000 FROM (
>>           SELECT id FROM test_tablesample TABLESAMPLE SYSTEM (50)
>>         ) foo;
>>
>>       Granted, there's still a small probability of false positive, but
>>       maybe that's sufficiently small? Or is the amount of code this
>>       tests negligible?
> 
> Well, depending on fillfactor and limit it could be made quite reliable
> I think, I also want to add test with VIEW (I think v2 has a bug there)
> and perhaps some JOIN.

OK.

>>
>> (10) In the initial patch you mentioned it's possible to write custom
>>       sampling methods. Do you think a CREATE TABLESAMPLE METHOD,
>>       allowing custom methods implemented as extensions would be useful?
>>
> 
> Yes, I think so, CREATE/DROP TABLESAMPLE METHOD is on my TODO, but since
> that's just simple mechanical work with no hard problems to solve there
> I can add it later once we have agreement on the general approach of the
> patch (especially in the terms of extensibility).

OK, good to know.

Tomas


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TABLESAMPLE patch

Reply via email to