Please redirect discussions to [email protected].

On 2011-09-26 10:41, Roberto Cornacchia wrote:
> Lefteris, I would like to add something about the syntax (and semantics)
> of this feature.
> See below
>  
> 
>     +/*
>     + * @- Uniform Sampling.
>     + *
>     + * A new SQL operator has been added to support sampling the result
>     of a query.
>     + * The syntax for sampling is:
>     + * SELECT ... FROM ... WHERE ... SAMPLE s
>     + *
>     + * where s if is an integer greater than 1, it defines the number
>     of rows to be
>     + * in the sample. If s is a double between [0.0,1.0] the it refers
>     to the
>     + * percentage of the result to be sampled. That is if s=0.3 then
>     the sample
>     + * will be 30% the size of the query result.
>     + *
> 
> 
> It has been discussed already whether this choice or an explicit PERCENT
> is more intuitive.
> Personally, I don't care much about that.
>  
> 
>     + * SAMPLE is been treated as LIMIT, ORDER BY, etc., that means that
>     it can only
>     + * be in the outer most SELECT clause, i.e., SAMPLE cannot appear in a
>     + * subquery. 
> 
> 
> This is much more interesting to me.
> I see that SAMPLE fits in the same semantics as LIMIT. In fact, you can
> think of it as a special LIMIT.
> What I don't like about it, is that it then shares the same limitations
> (which are absurd, as the whole SQL often is, from my point of view),
> which is what you mention above about the subquery.
> You suggest a workaround here:
> 
>     However, if this is needed, then one may define a function, for
>     + * example
>     + *
>     + * CREATE FUNCTION mysample ()
>     + * RETURNS TABLE(col a,...)
>     + * BEGIN
>     + *    RETURN
>     + *      SELECT a,...
>     + *      FROM name_table
>     + *      SAMPLE 100;
>     + * end;
>     + *
>     + * and then use function mysample() for example to populate a new
>     table with
>     + * the sample. E.g.,
>     + *
>     + * INSERT INTO sample_table (SELECT * FROM mysample());
>     + *
>     + *
> 
> 
> To me, writing a function with an hard-coded table name in it is...
> weird! Like writing a function that can only do 2+2.
> (This could be much better if table identifiers were allowed as function
> parameters, a feature that I'd love to see implemented!).
> 
> To this, I much prefer a solution based on the window functions RANK()
> OVER()  and ROW_NUMBER() OVER() which luckily are implemented in
> MonetDB, and the function rand().
> 
> Your example above becomes a regular query that has no nesting-issue:
> 
> SELECT a,b,c 
> FROM (
>   SELECT a,b,c, ROW_NUMBER() OVER(ORDER BY rand()) as my_rank
>   FROM name_table
> ) as x
> where x.my_rank <= 100;
> 
> Replace ROW_NUMBER() with RANK() to allow duplicates in the random sample.
> 
> I'm not comparing the efficiency of the two solutions here, your
> solution could well be more efficient. 
> I'm only comparing them from a usability point of view.
> 
> I noticed that my own usage of the LIMIT clause is becoming less and
> less frequent (in favour of the syntax I report here) because of the
> nesting issue.
> What I am wondering is, how much useful is to introduce a new syntax for
> the sampling that has this (important, in my view) limitation? How often
> would you need a sample in the outermost select clause only?
> 
> Cheers,
> Roberto
> 
> 
> 
> _______________________________________________
> Checkin-list mailing list
> [email protected]
> http://mail.monetdb.org/mailman/listinfo/checkin-list


-- 
Sjoerd Mullender

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Checkin-list mailing list
[email protected]
http://mail.monetdb.org/mailman/listinfo/checkin-list

Reply via email to