On 2015-07-13 15:39, Tom Lane wrote:

Datum
tsm_system_rows_init(PG_FUNCTION_ARGS)
{
     TableSampleDesc *tsdesc = (TableSampleDesc *) PG_GETARG_POINTER(0);
     uint32      seed = PG_GETARG_UINT32(1);
     int32       ntuples = PG_ARGISNULL(2) ? -1 : PG_GETARG_INT32(2);

This is rather curious coding.  Why should this function check only
one of its arguments for nullness?  If it needs to defend against
any of them being null, it really needs to check all.  But in point of
fact, this function is declared STRICT, which means it's a violation of
the function call protocol if the core code ever passes a null to it,
and so this test ought to be dead code.

A similar pattern of ARGISNULL checks in declared-strict functions exists
in all the tablesample modules, not just this one, showing that this is an
overall design error not just a thinko here.  My inclination would be to
make the core code enforce non-nullness of all tablesample arguments so as
to make it unnecessary to check strictness of the tsm*init functions, but
in any case it is Not Okay to just pass nulls willy-nilly to strict C
functions.


The reason for this ugliness came from having to have balance between modularity and following the SQL Standard error codes for BERNOULLI and SYSTEM, which came as issue during reviews (the original code did the checks before calling the sampling method's functions but produced just generic error code about wrong parameter). I considered it as okayish mainly because those functions are not SQL callable and the code which calls them knows how to handle it correctly, but I understand why you don't.

I guess if we did what you proposed in another email in this thread - don't have the API exposed on SQL level, we could send the additional parameters as List * and leave the validation completely to the function. (And maybe don't allow NULLs at all)

Also, I find this coding pretty sloppy even without the strictness angle,
because the net effect of this way of dealing with nulls is that an
argument-must-not-be-null complaint is reported as "out of range",
which is both confusing and the wrong ERRCODE.

     if (ntuples < 1)
         ereport(ERROR,
                 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                  errmsg("invalid sample size"),
                  errhint("Sample size must be positive integer value.")));

I don't find this to be good error message style.  The secondary comment
is not a "hint", it's an ironclad statement of what you did wrong, so if
we wanted to phrase it like this it should be an errdetail not errhint.
But the whole thing is overly cute anyway because there is no reason at
all not to just say what we mean in the primary error message, eg
         ereport(ERROR,
                 (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                  errmsg("sample size must be greater than zero")));


Same as above, although now that I re-read the standard I am sure I misunderstand it the first time - it says: "If S is the null value or if S < 0 (zero) or if S > 100, then an exception condition is raised: data exception — invalid sample size."

I took it as literal error message originally but they just mean status code by it.

While we're on the subject, what's the reason for disallowing a sample
size of zero?  Seems like a reasonable edge case.

Yes that's a bug.


     /* All blocks have been read, we're done */
     if (sampler->doneblocks > sampler->nblocks ||
         sampler->donetuples >= sampler->ntuples)
         PG_RETURN_UINT32(InvalidBlockNumber);

Okay, I lied, I *am* going to complain about this comment.  Comments that
do not accurately describe the code they're attached to are worse than
useless.


That's copy-pasto from the tsm_system_time.

In short, I'd do something more like

     uint32      r;

     /* Safety check to avoid infinite loop or zero result for small n. */
     if (n <= 1)
         return 1;

     /*
      * This should only take 2 or 3 iterations as the probability of 2 numbers
      * being relatively prime is ~61%; but just in case, we'll include a
      * CHECK_FOR_INTERRUPTS in the loop.
      */
     do {
         CHECK_FOR_INTERRUPTS();
         r = (uint32) (sampler_random_fract(randstate) * n);
     } while (r == 0 || gcd(r, n) > 1);

Note however that this coding would result in an unpredictable number
of iterations of the RNG, which might not be such a good thing if we're
trying to achieve repeatability.  It doesn't matter in the context of
this module since the RNG is not used after initialization, but it would
matter if we then went on to do Bernoulli-style sampling.  (Possibly that
could be dodged by reinitializing the RNG after the initialization steps.)


Bernoulli-style sampling does not need this kind of code so it's not really an issue. That is unless you'd like to combine the linear probing and bernoulli of course, but I don't see any benefit in doing that.

--
 Petr Jelinek                  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to