Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Benchao Li Sun, 20 Aug 2023 19:21:00 -0700

It sounds good to me too, that we avoid introducing the concept of "system
columns" for now.


Timo Walther <[email protected]> 于2023年8月18日周五 22:38写道：

> Great, I also like my last suggestion as it is even more elegant. I will
> update the FLIP until Monday.
>
> Regards,
> Timo
>
> On 17.08.23 13:55, Jark Wu wrote:
> > Hi Timo,
> >
> > I'm fine with your latest suggestion that introducing a flag to control
> > expanding behavior of metadata virtual columns, but not introducing
> > any concept of system/pseudo columns for now.
> >
> > Best,
> > Jark
> >
> > On Tue, 15 Aug 2023 at 23:25, Timo Walther <[email protected]> wrote:
> >
> >> Hi everyone,
> >>
> >> I would like to bump this thread up again.
> >>
> >> Esp. I would like to hear opinions on my latest suggestions to simply
> >> use METADATA VIRTUAL as system columns and only introduce a config
> >> option for the SELECT * behavior. Implementation-wise this means minimal
> >> effort and less new concepts.
> >>
> >> Looking forward to any kind of feedback.
> >>
> >> Thanks,
> >> Timo
> >>
> >> On 07.08.23 12:07, Timo Walther wrote:
> >>> Hi everyone,
> >>>
> >>> thanks for the various feedback and lively discussion. Sorry, for the
> >>> late reply as I was on vacation. Let me answer to some of the topics:
> >>>
> >>> 1) Systems columns should not be shown with DESCRIBE statements
> >>>
> >>> This sounds fine to me. I will update the FLIP in the next iteration.
> >>>
> >>> 2) Do you know why most SQL systems do not need any prefix with their
> >>> pseudo column?
> >>>
> >>> Because most systems do not have external catalogs or connectors. And
> >>> also the number of system columns is limited to a handful of columns.
> >>> Flink is more generic and thus more complex. And we have already the
> >>> concepts of metadata columns. We need to be careful with not
> overloading
> >>> our language.
> >>>
> >>> 3) Implementation details
> >>>
> >>>   > how to you plan to implement the "system columns", do we need to
> add
> >>> it to `RelNode` level? Or we just need to do it in the
> >>> parsing/validating phase?
> >>>   > I'm not sure that Calcite's "system column" feature is fully ready
> >>>
> >>> My plan would be to only modify the parsing/validating phase. I would
> >>> like to avoid additional complexity in planner rules and
> >>> connector/catalog interfaces. Metadata columns already support
> >>> projection push down and are passed through the stack (via Schema,
> >>> ResolvedSchema, SupportsReadableMetadata). Calcite's "system column"
> >>> feature is not fully ready yet and it would be a large effort
> >>> potentially introducing bugs in supporting it. Thus, I'm proposing to
> >>> leverage what we already have. The only part that needs to be modified
> >>> is the "expand star" method in SqlValidator and Table API.
> >>>
> >>> Queries such as `SELECT * FROM (SELECT $rowtime, * FROM t);` would show
> >>> $rowtime as the expand star has only a special case when in the scope
> >>> for `FROM t`. All further subqueries treat it as a regular column.
> >>>
> >>> 4) Built-in defined pseudo-column "$rowtime"
> >>>
> >>>   > Did you consider making it as a built-in defined pseudo-column
> >>> "$rowtime" which returns the time attribute value (if exists) or null
> >>> (if non-exists) for every table/query, and pseudo-column "$proctime"
> >>> always returns PROCTIME() value for each table/query
> >>>
> >>> Built-in pseudo-columns mean that connector or catalog providers need
> >>> consensus in Flink which pseudo-columns should be built-in. We should
> >>> keep the concept generic and let platform providers decide which pseudo
> >>> columns to expose. $rowtime might be obvious but others such as
> >>> $partition or $offset are tricky to get consensus as every external
> >>> connector works differently. Also a connector might want to expose
> >>> different time semantics (such as ingestion time).
> >>>
> >>> 5) Any operator can introduce system (psedo) columns.
> >>>
> >>> This is clearly out of scope for this FLIP. The implementation effort
> >>> would be huge and could introduce a lot of bugs.
> >>>
> >>> 6) "Metadata Key Prefix Constraint" which is still a little complex
> >>>
> >>> Another option could be to drop the naming pattern constraint. We could
> >>> make it configurable that METADATA VIRTUAL columns are never selected
> by
> >>> default in SELECT * or visible in DESCRIBE. This would further simplify
> >>> the FLIP and esp lower the impact on the planner and all interfaces.
> >>>
> >>> What do you think about this? We could introduce a flag:
> >>>
> >>> table.expand-metadata-columns (better name to be defined)
> >>>
> >>> This way we don't need to introduce the concept of system columns yet,
> >>> but can still offer similar functionality with minimal overhead in the
> >>> code base.
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>>
> >>>
> >>> On 04.08.23 23:06, Alexey Leonov-Vendrovskiy wrote:
> >>>> Looks like both kinds of system columns can converge.
> >>>> We can say that any operator can introduce system (psedo) columns.
> >>>>
> >>>> cc Eugene who is also interested in the subject.
> >>>>
> >>>> On Wed, Aug 2, 2023 at 1:03 AM Paul Lam <[email protected]>
> wrote:
> >>>>
> >>>>> Hi Timo,
> >>>>>
> >>>>> Thanks for starting the discussion! System columns are no doubt a
> >>>>> good boost on Flink SQL’s usability, and I see the feedbacks are
> >>>>> mainly concerns about the accessibility of system columns.
> >>>>>
> >>>>> I think most of the concerns could be solved by clarifying the
> >>>>> ownership of the system columns. Different from databases like
> >>>>> Oracle/BigQuery/PG who owns the data/metadata, Flink uses the
> >>>>> data/metadata from external systems. That means Flink could
> >>>>> have 2 kinds of system columns (take ROWID for example):
> >>>>>
> >>>>> 1. system columns provided by external systems via catalogs, such
> >>>>>       as ROWID from the original system.
> >>>>> 2. system columns generated by Flink, such as ROWID generated by
> >>>>>       Flink itself.
> >>>>>
> >>>>> IIUC, the FLIP is proposing the 1st approach: the catalog defines
> what
> >>>>> system columns to provide, and Flink treats them as normal columns
> >>>>> with a special naming pattern.
> >>>>>
> >>>>> On the other hand, Jark is proposing the 2nd one: the system columns
> >>>>> are defined and owned by Flink, and can be inferred from external
> >>>>> systems. Therefore, system columns should be predefined by Flink,
> >>>>> and optionally implemented by the catalogs.
> >>>>>
> >>>>> Personally, I’m in favor of the 2nd approach, because it makes the
> >>>>> system columns very accessible and more aligned across the catalogs.
> >>>>>
> >>>>> BTW, I second Alexey that systems columns should not be shown with
> >>>>> DESCRIBE statements.
> >>>>>
> >>>>> WDYT? Thanks!
> >>>>>
> >>>>> Best,
> >>>>> Paul Lam
> >>>>>
> >>>>>> 2023年7月31日 23:54，Jark Wu <[email protected]> 写道：
> >>>>>>
> >>>>>> Hi Timo,
> >>>>>>
> >>>>>> Thanks for your proposal. I think this is a nice feature for users
> >>>>>> and I
> >>>>>> prefer option 3.
> >>>>>>
> >>>>>> I only have one concern about the concept of pseudo-column or
> >>>>>> system-column,
> >>>>>> because this is the first time we introduce it in Flink SQL. The
> >>>>>> confusion is similar to the
> >>>>>> question of Benchao and Sergey about the propagation of
> pseudo-column.
> >>>>>>
> >>>>>>   From my understanding, a pseudo-column can be get from an
> arbitrary
> >>>>> query,
> >>>>>> just similar to
> >>>>>> ROWNUM in Oracle[1], such as :
> >>>>>>
> >>>>>> SELECT *
> >>>>>> FROM (SELECT * FROM employees ORDER BY employee_id)
> >>>>>> WHERE ROWNUM < 11;
> >>>>>>
> >>>>>> However, IIUC, the proposed "$rowtime" pseudo-column can only be got
> >>>>>> from
> >>>>>> the physical table
> >>>>>> and can't be got from queries even if the query propagates the
> rowtime
> >>>>>> attribute. There was also
> >>>>>> a discussion about adding a pseudo-column "_proctime" [2] to make
> >>>>>> lookup
> >>>>>> join easier to use
> >>>>>> which can be got from arbitrary queries. That "_proctime" may
> conflict
> >>>>> with
> >>>>>> the proposed
> >>>>>> pseudo-column concept.
> >>>>>>
> >>>>>> Did you consider making it as a built-in defined pseudo-column
> >>>>>> "$rowtime"
> >>>>>> which returns the
> >>>>>> time attribute value (if exists) or null (if non-exists) for every
> >>>>>> table/query, and pseudo-column
> >>>>>> "$proctime" always returns PROCTIME() value for each table/query. In
> >>>>>> this
> >>>>>> way, catalogs only need
> >>>>>> to provide a default rowtime attribute and users can get it in the
> >> same
> >>>>>> way. And we don't need
> >>>>>> to introduce the contract interface of "Metadata Key Prefix
> >> Constraint"
> >>>>>> which is still a little complex
> >>>>>> for users and devs to understand.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>> [1]:
> >>>>>>
> >>>>>
> >>
> https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255
> >>>>>> [2]:
> https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy <
> >>>>>> [email protected]> wrote:
> >>>>>>
> >>>>>>>>
> >>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);`
> >>>>>>>> Am I right that it will show `$rowtime` in output ?
> >>>>>>>
> >>>>>>>
> >>>>>>> Yes, all explicitly selected columns become a part of the result
> (and
> >>>>>>> intermediate) schema, and hence propagate.
> >>>>>>>
> >>>>>>> On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy <
> >>>>>>> [email protected]> wrote:
> >>>>>>>
> >>>>>>>> Thank you, Timo, for starting this FLIP!
> >>>>>>>>
> >>>>>>>> I propose the following change:
> >>>>>>>>
> >>>>>>>> Remove the requirement that DESCRIBE need to show system columns.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Some concrete vendor specific catalog implementations might prefer
> >>>>>>>> this
> >>>>>>>> approach.
> >>>>>>>> Usually the same system columns are available on all (or family)
> of
> >>>>>>>> tables, and it can be easily captured in the documentation.
> >>>>>>>>
> >>>>>>>> For example, BigQuery does exactly this: there, pseudo-columns do
> >> not
> >>>>>>> show
> >>>>>>>> up in the table schema in any place, but can be accessed via
> >>>>>>>> reference.
> >>>>>>>>
> >>>>>>>> So I propose we:
> >>>>>>>> a) Either we say that DESCRIBE doesn't show system columns,
> >>>>>>>> b) Or leave this vendor-specific / or configurable via flag (if
> >>>>> needed).
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Alexey
> >>>>>>>>
> >>>>>>>> On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin <
> >> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Timo,
> >>>>>>>>>
> >>>>>>>>> Thanks for the FLIP.
> >>>>>>>>> I also tend to think that Option 3 is better.
> >>>>>>>>>
> >>>>>>>>> I would be also interested in a question mentioned by Benchao Li.
> >>>>>>>>> And a similar question about nested queries like
> >>>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);`
> >>>>>>>>> Am I right that it will show `$rowtime` in output ?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jul 27, 2023 at 6:58 AM Benchao Li <[email protected]
> >
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Timo,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the FLIP, I also like the idea and option 3 sounds
> >>>>>>>>>> good to
> >>>>>>>>> me.
> >>>>>>>>>>
> >>>>>>>>>> I would like to discuss a case which is not mentioned in the
> >>>>>>>>>> current
> >>>>>>>>> FLIP.
> >>>>>>>>>> How are the "System column"s expressed in intermediate result,
> >> e.g.
> >>>>>>>>> Join?
> >>>>>>>>>> E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not include
> >>>>> "system
> >>>>>>>>>> columns" from t1 and t2 as you proposed, and for `SELECT
> >>>>>>>>>> t1.$rowtime,
> >>>>>>> *
> >>>>>>>>>> FROM t1 JOIN t2`, it should also be valid.
> >>>>>>>>>> Then the question is how to you plan to implement the "system
> >>>>>>> columns",
> >>>>>>>>> do
> >>>>>>>>>> we need to add it to `RelNode` level? Or we just need to do it
> >>>>>>>>>> in the
> >>>>>>>>>> parsing/validating phase?
> >>>>>>>>>> I'm not sure that Calcite's "system column" feature is fully
> ready
> >>>>> for
> >>>>>>>>> this
> >>>>>>>>>> since the code about this part is imported from the earlier
> >> project
> >>>>>>>>> before
> >>>>>>>>>> it gets into Apache, and has not been considered much in the
> past
> >>>>>>>>>> development.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Jing Ge <[email protected]> 于2023年7月26日周三 00:01写
> >>>>>>>>>> 道：
> >>>>>>>>>>
> >>>>>>>>>>> Hi Timo,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for your proposal. It is a very pragmatic feature. Among
> >>>>>>>>>>> all
> >>>>>>>>>> options
> >>>>>>>>>>> in the FLIP, option 3 is one I prefer too and I'd like to ask
> >> some
> >>>>>>>>>>> questions to understand your thoughts.
> >>>>>>>>>>>
> >>>>>>>>>>> 1. I did some research on pseudo columns, just out of
> >>>>>>>>>>> curiosity, do
> >>>>>>>>> you
> >>>>>>>>>>> know why most SQL systems do not need any prefix with their
> >> pseudo
> >>>>>>>>>> column?
> >>>>>>>>>>> 2. Some platform providers will use ${variable_name} to define
> >>>>>>>>>>> their
> >>>>>>>>> own
> >>>>>>>>>>> configurations and allow them to be embedded into SQL scripts.
> >>>>>>>>>>> Will
> >>>>>>>>> there
> >>>>>>>>>>> be any conflict with option 3?
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> Jing
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Jul 25, 2023 at 7:00 PM Konstantin Knauf
> >>>>>>>>>>> <[email protected]
> >>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Timo,
> >>>>>>>>>>>>
> >>>>>>>>>>>> this makes sense to me. Option 3 seems reasonable, too.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Konstantin
> >>>>>>>>>>>>
> >>>>>>>>>>>> Am Di., 25. Juli 2023 um 12:53 Uhr schrieb Timo Walther <
> >>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I would like to start a discussion about introducing the
> >> concept
> >>>>>>>>> of
> >>>>>>>>>>>>> "System Columns" in SQL and Table API.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The subject sounds bigger than it actually is. Luckily, Flink
> >>>>>>> SQL
> >>>>>>>>>>>>> already exposes the concept of metadata columns. And this
> >>>>>>>>> proposal is
> >>>>>>>>>>>>> just a slight adjustment for how metadata columns can be used
> >> as
> >>>>>>>>>> system
> >>>>>>>>>>>>> columns.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The biggest problem of metadata columns currently is that a
> >>>>>>>>> catalog
> >>>>>>>>>>>>> implementation can't provide them by default because they
> would
> >>>>>>>>>> affect
> >>>>>>>>>>>>> `SELECT *` when adding another one.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Looking forward to your feedback on FLIP-348:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Support+System+Columns+in+SQL+and+Table+API
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Timo
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> https://twitter.com/snntrable
> >>>>>>>>>>>> https://github.com/knaufk
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Benchao Li
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best regards,
> >>>>>>>>> Sergey
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

-- 

Best,
Benchao Li

Re: [DISCUSS] FLIP-348: Support System Columns in SQL and Table API

Reply via email to