The “system columns” feature was working when the Calcite code base was used in 
a previous project several years ago, though it may have atrophied since.

In that project, we actually had ROWTIME as a system column. It was partially 
successful, but raised some problems. Consider a streaming join. The source 
relations have ROWTIME columns, but the join would create an additional ROWTIME 
column.

(1) Should the original ROWTIME columns just disappear? It wasn’t clear.

(2) What if you wanted to assign your own ROWTIME from an expression. Is it 
sufficient to just write ‘expression AS ROWTIME’ in the SELECT clause? This a 
more profound operation than simply renaming a column. Being a system column 
implies certain data type, not null, and maybe ordering. Would you also have to 
apply an ORDER BY?

(3) Having system columns in streams conflicts with the goal of using the same 
SQL for streaming and historic (regular SQL) queries.

(4) We found ourselves relying on these columns under the covers in certain 
operators. The system had become non-relational.

So, after that experience I concluded that ROWTIME should be just a regular 
column. Just a column that we know a lot about: it is a timestamp (usually; 
although we allow other data types); it is conventionally called ROWTIME but it 
doesn’t have to be; it is not null; it is usually a sort column (but it may not 
be - the stream might be k-sorted, e.g. within 10 rows or 10 minutes of being 
sorted, or sorted within a particular warehouseId; or “sortable").

The “sortable” concept is really powerful. Consider the Orders stream and the 
Orders table. The stream is infinite, so the only sortable columns are those 
that are already sorted or are within N rows or T seconds of being sorted, or 
have some kind of guarantee in terms of watermarks. The table is finite, so 
everything is sortable. If we want to do an operation such as “GROUP BY 
FLOOR(ROWTIME TO HOUR)” it is sufficient that ROWTIME is sortable. You do not 
need ROWTIME to be a system column.

My advice is, rather than requiring “blessed” system columns, you have a 
convention for the names of event-time and processing-time columns, and make 
your operators consume and produce on those columns explicitly.

Julian



> On Feb 15, 2017, at 12:56 AM, Timo Walther <[email protected]> wrote:
> 
> Hi everyone,
> 
> we (from Flink) are currently discussing how we can express time-semantics 
> (event-time or processing-time) in a SQL query. The optimal solution would be 
> to have two system attributes that are part of every table schema/every row 
> data type. We could then access it like `SELECT * FROM MyTable ORDER BY 
> rowtime`. However, it should not be part of the result in an expansion (`*`) 
> and the user should not modify those attributes (no aliasing, read-only). I 
> had a look into SqlValidator and there are several lines that contain things 
> like `includeSystemVars` or `isSystemField` but nothing concrete. Am I right 
> that this feature is not entirely implemented yet? Which parts would you 
> touch/override to implement this feature?
> 
> Thanks in advance.
> 
> Regards,
> Timo
> 
> 
> 

Reply via email to