Please redirect discussions to [email protected]. On 2011-09-26 10:41, Roberto Cornacchia wrote: > Lefteris, I would like to add something about the syntax (and semantics) > of this feature. > See below > > > +/* > + * @- Uniform Sampling. > + * > + * A new SQL operator has been added to support sampling the result > of a query. > + * The syntax for sampling is: > + * SELECT ... FROM ... WHERE ... SAMPLE s > + * > + * where s if is an integer greater than 1, it defines the number > of rows to be > + * in the sample. If s is a double between [0.0,1.0] the it refers > to the > + * percentage of the result to be sampled. That is if s=0.3 then > the sample > + * will be 30% the size of the query result. > + * > > > It has been discussed already whether this choice or an explicit PERCENT > is more intuitive. > Personally, I don't care much about that. > > > + * SAMPLE is been treated as LIMIT, ORDER BY, etc., that means that > it can only > + * be in the outer most SELECT clause, i.e., SAMPLE cannot appear in a > + * subquery. > > > This is much more interesting to me. > I see that SAMPLE fits in the same semantics as LIMIT. In fact, you can > think of it as a special LIMIT. > What I don't like about it, is that it then shares the same limitations > (which are absurd, as the whole SQL often is, from my point of view), > which is what you mention above about the subquery. > You suggest a workaround here: > > However, if this is needed, then one may define a function, for > + * example > + * > + * CREATE FUNCTION mysample () > + * RETURNS TABLE(col a,...) > + * BEGIN > + * RETURN > + * SELECT a,... > + * FROM name_table > + * SAMPLE 100; > + * end; > + * > + * and then use function mysample() for example to populate a new > table with > + * the sample. E.g., > + * > + * INSERT INTO sample_table (SELECT * FROM mysample()); > + * > + * > > > To me, writing a function with an hard-coded table name in it is... > weird! Like writing a function that can only do 2+2. > (This could be much better if table identifiers were allowed as function > parameters, a feature that I'd love to see implemented!). > > To this, I much prefer a solution based on the window functions RANK() > OVER() and ROW_NUMBER() OVER() which luckily are implemented in > MonetDB, and the function rand(). > > Your example above becomes a regular query that has no nesting-issue: > > SELECT a,b,c > FROM ( > SELECT a,b,c, ROW_NUMBER() OVER(ORDER BY rand()) as my_rank > FROM name_table > ) as x > where x.my_rank <= 100; > > Replace ROW_NUMBER() with RANK() to allow duplicates in the random sample. > > I'm not comparing the efficiency of the two solutions here, your > solution could well be more efficient. > I'm only comparing them from a usability point of view. > > I noticed that my own usage of the LIMIT clause is becoming less and > less frequent (in favour of the syntax I report here) because of the > nesting issue. > What I am wondering is, how much useful is to introduce a new syntax for > the sampling that has this (important, in my view) limitation? How often > would you need a sample in the outermost select clause only? > > Cheers, > Roberto > > > > _______________________________________________ > Checkin-list mailing list > [email protected] > http://mail.monetdb.org/mailman/listinfo/checkin-list
-- Sjoerd Mullender
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Checkin-list mailing list [email protected] http://mail.monetdb.org/mailman/listinfo/checkin-list
