[
https://issues.apache.org/jira/browse/CALCITE-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jing Zhang resolved CALCITE-4865.
---------------------------------
Fix Version/s: 1.31.0
Resolution: Resolved
> Allow table functions to be polymorphic
> ---------------------------------------
>
> Key: CALCITE-4865
> URL: https://issues.apache.org/jira/browse/CALCITE-4865
> Project: Calcite
> Issue Type: Sub-task
> Components: core
> Reporter: Jing Zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.31.0
>
> Attachments: image-2022-06-26-00-25-14-422.png
>
> Time Spent: 8h 10m
> Remaining Estimate: 0h
>
> h1. Background
> h2. What makes table function to be polymorphic
> A Polymorphic Table Function is introduced in SQL 2017 standard. It requires
> table function could take and return tables with row type not declared at
> design time.
> In fact, table functions in Calcite already supports this function more or
> less. For example [Tumble/Hop/Session Window table function
> |https://calcite.apache.org/docs/reference.html#table-functions] could take a
> table parameter which row type is not declared at design time, and generate a
> return table which row type is not declared, either. However, we hadn't
> thought about how to support this feature systematically in the CALCITE.
> Therefore, the current implementation misses some points and conflicts with
> the SQL standard, for example, for [Session Window table function
> |https://calcite.apache.org/docs/reference.html#table-functions]:
> {code:java}
> SELECT * FROM TABLE(
> SESSION(
> TABLE orders,
> DESCRIPTOR(rowtime),
> DESCRIPTOR(product),
> INTERVAL '20' MINUTE));
> {code}
>
> But, a desired way from SQL standard 2017 Polymorphic table functions is to
> use `PARTITION BY` clause to replace KEY DESCRIPTOR.
> {code:java}
> SELECT * FROM TABLE(
> SESSION(
> TABLE orders PARTITION BY product,
> DESCRIPTOR(rowtime),
> INTERVAL '20' MINUTE));
> {code}
> I propose to support polymorphic table function in a better way.
> h2. Row semantics and set semantics
> Table function could have generic table parameters. And input table
> parameters are classified by three characteristics:
> # Input tables have either row semantics or set semantics)
> ## Row semantics means that the the result of the function is determined
> only by looking at the current row
> ## Set semantics means that the result of the function can be determined by
> looking at the current row and some state “summarized” from previously
> processed rows. The table argument with set semantics can optionally be
> extended with either a {{PARTITION BY}} clause or an {{ORDER BY}} clause or
> both.
> # The second characteristic, which applies only to input tables with set
> semantics, is whether the function can generate a result row even if the
> input table is empty.
> # The third characteristic is whether the input table supports pass-through
> columns or not. Pass-through columns is a mechanism enabling the function to
> copy every column of an input row into columns of an output row.
> h2. Examples
> h3. Table function contains a table parameter with row semantics
> We often need to read a CSV file, generally, the first line of the file
> contains a list of column names, and subsequent lines of the file contain
> data. The data in general can be treated as a large VARCHAR. However, some of
> the fields may be numeric or datetime.
> A PTF named CSVreader is designed to read a file of comma-separated values
> and interpret this file as a table.
> CSVreader function has three parameters:
> # The first parameter, File, is the name of a file on the query author's
> system. This file must contain the comma-separated values that are to be
> converted to a table. The first line of the file contains the names of the
> resulting columns. Succeeding lines contain the data. Each line after the
> first will result in one row of output, with column names as determined by
> the first line of the input.
> # Floats is a PTF descriptor area, which should provide a list of the column
> names that are to be interpreted numerically. These columns will be output
> with the data type FLOAT.
> # Dates is a PTF descriptor area, which provides a list of the column names
> that are to be interpreted as datetimes. These columns will be output with
> the data type DATE.
> How to use this table function in query?
> For example, there is a csv file named abc.csv with the following contents:
> {code:java}
> docno,name,due_date,principle,interest
> 123,Mary,01/01/2014,234.56,345.67
> 234,Edgar,01/01/2014,654.32,543.21 {code}
>
> the query author may write a query such as the following:
> {code:java}
> SELECT *
> FROM TABLE (
> CSVreader (
> File => 'abc.csv',
> Floats => DESCRIPTOR ("principle", "interest")
> Dates => DESCRIPTOR ("due_date")
> )
> ) AS S {code}
>
> The result will be
> ||docno||name||due_date||principle||interest||
> |123|Mary|01/01/2014|234.56|345.67|
> |234|Edgar|01/01/2014|654.32|543.21|
>
> h3. Table function contains a table parameter with set semantics
> TopN takes an input table that has been sorted on a numeric column. It copies
> the first n rows through to the output table. Any additional rows are
> summarized in a single output row in which the sort column has been summed
> and all other columns are null.
> TopN function has two parameters:
> # The first parameter, Input, is the input table. This table has set
> semantics, meaning that the result depends on the set of data (since the last
> row is a summary row). The query author must order this input table on a
> single numeric column (syntax below).
> # The second parameter, Howmany, specifies how many input rows that the user
> wants to be copied into the output table; all rows after this will contribute
> to the final summary row in the output.
> How to use this function in query?
> For example, if the contents of source table orders are:
> ||region||product||sales||
> |East|A|1234.56|
> |East|B|987.65|
> |East|C|876.54|
> |East|D|765.43|
> |East|E|654.32|
> |West|E|2345.67|
> |West|D|2001.33|
> |West|C|1357.99|
> |West|B|975.35|
> |West|A|864,22 |
>
> the query author may write a query such as the following:
> {code:java}
> SELECT *
> FROM TABLE(
> topn(
> Input => TABLE orders PARTITION BY region ORDER BY sales desc,
> Howmany => 3
> )
> ){code}
>
> The result will be
> ||region||product||sales||
> |East|A|1234.56|
> |East|B|987.65|
> |East|C|876.54|
> |West|E|2345.67|
> |West|D|2001.33|
> |West|C|1357.99|
> h3.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)