[ 
https://issues.apache.org/jira/browse/CALCITE-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhang resolved CALCITE-4865.
---------------------------------
    Fix Version/s: 1.31.0
       Resolution: Resolved

> Allow table functions to be polymorphic
> ---------------------------------------
>
>                 Key: CALCITE-4865
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4865
>             Project: Calcite
>          Issue Type: Sub-task
>          Components: core
>            Reporter: Jing Zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.31.0
>
>         Attachments: image-2022-06-26-00-25-14-422.png
>
>          Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> h1. Background
> h2. What makes table function to be polymorphic
> A Polymorphic Table Function is introduced in SQL 2017 standard. It requires 
> table function could take and return tables with row type not declared at 
> design time.
> In fact, table functions in Calcite already supports this function more or 
> less. For example [Tumble/Hop/Session Window table function 
> |https://calcite.apache.org/docs/reference.html#table-functions] could take a 
> table parameter which row type is not declared at design time, and generate a 
> return table which row type is not declared, either. However, we hadn't 
> thought about how to support this feature systematically in the CALCITE.
> Therefore, the current implementation misses some points and conflicts with 
> the SQL standard, for example, for [Session Window table function 
> |https://calcite.apache.org/docs/reference.html#table-functions]:
> {code:java}
> SELECT * FROM TABLE(
>   SESSION(
>     TABLE orders,
>     DESCRIPTOR(rowtime),
>     DESCRIPTOR(product),
>     INTERVAL '20' MINUTE));
> {code}
>  
> But, a desired way from SQL standard 2017 Polymorphic table functions is to 
> use `PARTITION BY` clause to replace KEY DESCRIPTOR.
> {code:java}
> SELECT * FROM TABLE(
>   SESSION(
>     TABLE orders PARTITION BY product,
>     DESCRIPTOR(rowtime),
>     INTERVAL '20' MINUTE));
> {code}
> I propose to support polymorphic table function in a better way.
> h2. Row semantics and set semantics
> Table function could have generic table parameters. And input table 
> parameters are classified by three characteristics:
>  # Input tables have either row semantics or set semantics)
>  ## Row semantics means that the the result of the function is determined 
> only by looking at the current row
>  ## Set semantics means that the result of the function can be determined by 
> looking at the current row and some state “summarized” from previously 
> processed rows.  The table argument with set semantics can optionally be 
> extended with either a {{PARTITION BY}} clause or an {{ORDER BY}} clause or 
> both.
>  # The second characteristic, which applies only to input tables with set 
> semantics, is whether the function can generate a result row even if the 
> input table is empty.
>  # The third characteristic is whether the input table supports pass-through 
> columns or not. Pass-through columns is a mechanism enabling the function to 
> copy every column of an input row into columns of an output row.
> h2. Examples 
> h3. Table function contains a table parameter with row semantics
> We often need to read a CSV file, generally, the first line of the file 
> contains a list of column names, and subsequent lines of the file contain 
> data. The data in general can be treated as a large VARCHAR. However, some of 
> the fields may be numeric or datetime.
> A PTF named CSVreader is designed to read a file of comma-separated values 
> and interpret this file as a table.
> CSVreader function has three parameters:
>  # The first parameter, File, is the name of a file on the query author's 
> system. This file must contain the comma-separated values that are to be 
> converted to a table. The first line of the file contains the names of the 
> resulting columns. Succeeding lines contain the data. Each line after the 
> first will result in one row of output, with column names as determined by 
> the first line of the input.
>  # Floats is a PTF descriptor area, which should provide a list of the column 
> names that are to be interpreted numerically. These columns will be output 
> with the data type FLOAT.
>  # Dates is a PTF descriptor area, which provides a list of the column names 
> that are to be interpreted as datetimes. These columns will be output with 
> the data type DATE.
> How to use this table function in query?
> For example, there is a csv file named abc.csv with the following contents:
> {code:java}
> docno,name,due_date,principle,interest
> 123,Mary,01/01/2014,234.56,345.67
> 234,Edgar,01/01/2014,654.32,543.21 {code}
>  
> the query author may write a query such as the following:
> {code:java}
> SELECT *
> FROM TABLE ( 
>   CSVreader ( 
>     File => 'abc.csv',
>     Floats => DESCRIPTOR ("principle", "interest")
>     Dates => DESCRIPTOR ("due_date")
>   ) 
> ) AS S {code}
>  
> The result will be
> ||docno||name||due_date||principle||interest||
> |123|Mary|01/01/2014|234.56|345.67|
> |234|Edgar|01/01/2014|654.32|543.21|
>  
> h3. Table function contains a table parameter with set semantics
> TopN takes an input table that has been sorted on a numeric column. It copies 
> the first n rows through to the output table. Any additional rows are 
> summarized in a single output row in which the sort column has been summed 
> and all other columns are null.
> TopN function has two parameters:
>  # The first parameter, Input, is the input table. This table has set 
> semantics, meaning that the result depends on the set of data (since the last 
> row is a summary row). The query author must order this input table on a 
> single numeric column (syntax below).
>  # The second parameter, Howmany, specifies how many input rows that the user 
> wants to be copied into the output table; all rows after this will contribute 
> to the final summary row in the output.
> How to use this function in query?
> For example, if the contents of source table orders are:
> ||region||product||sales||
> |East|A|1234.56|
> |East|B|987.65|
> |East|C|876.54|
> |East|D|765.43|
> |East|E|654.32|
> |West|E|2345.67|
> |West|D|2001.33|
> |West|C|1357.99|
> |West|B|975.35|
> |West|A|864,22 |
>  
> the query author may write a query such as the following:
> {code:java}
> SELECT *
> FROM TABLE(
>   topn(
>     Input => TABLE orders PARTITION BY region ORDER BY sales desc,
>     Howmany => 3
>   )
> ){code}
>  
> The result will be
> ||region||product||sales||
> |East|A|1234.56|
> |East|B|987.65|
> |East|C|876.54|
> |West|E|2345.67|
> |West|D|2001.33|
> |West|C|1357.99|
> h3.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to