Rohan Garg created CALCITE-5084:
-----------------------------------
Summary: Support ROWS syntax with TABLESAMPLE
Key: CALCITE-5084
URL: https://issues.apache.org/jira/browse/CALCITE-5084
Project: Calcite
Issue Type: Task
Reporter: Rohan Garg
Currently, Calcite provides a useful syntax for TABLESAMPLE which allows users
to sample the data being processed. It has two main parameters :
1. sampling algorithm (BERNOULLI or SYSTEM)
2. sampling percentage (a value between 0 and 100 to indicate rate of sampling)
While percentage is generally good, it is not always possible to provide a
decent value for it if the user is unaware of the row counts. Further incase of
subqueries (assuming that the underlying system handles tablesample with
subqueries), it becomes even more difficult to estimate the correct percentage
value.
Most likely the 'n ROWS' syntax is not a part of the SQL standard and hence
wasn't included in the default calcite grammar. But, a few systems have
implemented it in their dialects :
1. MS SQL Server :
[https://docs.microsoft.com/en-us/sql/t-sql/queries/from-transact-sql?view=sql-server-ver15#tablesample-clause]
2. Snowflake :
[https://docs.snowflake.com/en/sql-reference/constructs/sample.html]
3. Google Spanner :
[https://cloud.google.com/spanner/docs/reference/standard-sql/query-syntax#tablesample_operator]
4. Apache Spark :
[https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html]
So, it would be a useful addition to Calcite.
Derived from https://issues.apache.org/jira/browse/CALCITE-5074
--
This message was sent by Atlassian Jira
(v8.20.1#820001)