It sounds good to me too, that we avoid introducing the concept of "system columns" for now.
Timo Walther <twal...@apache.org> 于2023年8月18日周五 22:38写道: > Great, I also like my last suggestion as it is even more elegant. I will > update the FLIP until Monday. > > Regards, > Timo > > On 17.08.23 13:55, Jark Wu wrote: > > Hi Timo, > > > > I'm fine with your latest suggestion that introducing a flag to control > > expanding behavior of metadata virtual columns, but not introducing > > any concept of system/pseudo columns for now. > > > > Best, > > Jark > > > > On Tue, 15 Aug 2023 at 23:25, Timo Walther <twal...@apache.org> wrote: > > > >> Hi everyone, > >> > >> I would like to bump this thread up again. > >> > >> Esp. I would like to hear opinions on my latest suggestions to simply > >> use METADATA VIRTUAL as system columns and only introduce a config > >> option for the SELECT * behavior. Implementation-wise this means minimal > >> effort and less new concepts. > >> > >> Looking forward to any kind of feedback. > >> > >> Thanks, > >> Timo > >> > >> On 07.08.23 12:07, Timo Walther wrote: > >>> Hi everyone, > >>> > >>> thanks for the various feedback and lively discussion. Sorry, for the > >>> late reply as I was on vacation. Let me answer to some of the topics: > >>> > >>> 1) Systems columns should not be shown with DESCRIBE statements > >>> > >>> This sounds fine to me. I will update the FLIP in the next iteration. > >>> > >>> 2) Do you know why most SQL systems do not need any prefix with their > >>> pseudo column? > >>> > >>> Because most systems do not have external catalogs or connectors. And > >>> also the number of system columns is limited to a handful of columns. > >>> Flink is more generic and thus more complex. And we have already the > >>> concepts of metadata columns. We need to be careful with not > overloading > >>> our language. > >>> > >>> 3) Implementation details > >>> > >>> > how to you plan to implement the "system columns", do we need to > add > >>> it to `RelNode` level? Or we just need to do it in the > >>> parsing/validating phase? > >>> > I'm not sure that Calcite's "system column" feature is fully ready > >>> > >>> My plan would be to only modify the parsing/validating phase. I would > >>> like to avoid additional complexity in planner rules and > >>> connector/catalog interfaces. Metadata columns already support > >>> projection push down and are passed through the stack (via Schema, > >>> ResolvedSchema, SupportsReadableMetadata). Calcite's "system column" > >>> feature is not fully ready yet and it would be a large effort > >>> potentially introducing bugs in supporting it. Thus, I'm proposing to > >>> leverage what we already have. The only part that needs to be modified > >>> is the "expand star" method in SqlValidator and Table API. > >>> > >>> Queries such as `SELECT * FROM (SELECT $rowtime, * FROM t);` would show > >>> $rowtime as the expand star has only a special case when in the scope > >>> for `FROM t`. All further subqueries treat it as a regular column. > >>> > >>> 4) Built-in defined pseudo-column "$rowtime" > >>> > >>> > Did you consider making it as a built-in defined pseudo-column > >>> "$rowtime" which returns the time attribute value (if exists) or null > >>> (if non-exists) for every table/query, and pseudo-column "$proctime" > >>> always returns PROCTIME() value for each table/query > >>> > >>> Built-in pseudo-columns mean that connector or catalog providers need > >>> consensus in Flink which pseudo-columns should be built-in. We should > >>> keep the concept generic and let platform providers decide which pseudo > >>> columns to expose. $rowtime might be obvious but others such as > >>> $partition or $offset are tricky to get consensus as every external > >>> connector works differently. Also a connector might want to expose > >>> different time semantics (such as ingestion time). > >>> > >>> 5) Any operator can introduce system (psedo) columns. > >>> > >>> This is clearly out of scope for this FLIP. The implementation effort > >>> would be huge and could introduce a lot of bugs. > >>> > >>> 6) "Metadata Key Prefix Constraint" which is still a little complex > >>> > >>> Another option could be to drop the naming pattern constraint. We could > >>> make it configurable that METADATA VIRTUAL columns are never selected > by > >>> default in SELECT * or visible in DESCRIBE. This would further simplify > >>> the FLIP and esp lower the impact on the planner and all interfaces. > >>> > >>> What do you think about this? We could introduce a flag: > >>> > >>> table.expand-metadata-columns (better name to be defined) > >>> > >>> This way we don't need to introduce the concept of system columns yet, > >>> but can still offer similar functionality with minimal overhead in the > >>> code base. > >>> > >>> Regards, > >>> Timo > >>> > >>> > >>> > >>> > >>> On 04.08.23 23:06, Alexey Leonov-Vendrovskiy wrote: > >>>> Looks like both kinds of system columns can converge. > >>>> We can say that any operator can introduce system (psedo) columns. > >>>> > >>>> cc Eugene who is also interested in the subject. > >>>> > >>>> On Wed, Aug 2, 2023 at 1:03 AM Paul Lam <paullin3...@gmail.com> > wrote: > >>>> > >>>>> Hi Timo, > >>>>> > >>>>> Thanks for starting the discussion! System columns are no doubt a > >>>>> good boost on Flink SQL’s usability, and I see the feedbacks are > >>>>> mainly concerns about the accessibility of system columns. > >>>>> > >>>>> I think most of the concerns could be solved by clarifying the > >>>>> ownership of the system columns. Different from databases like > >>>>> Oracle/BigQuery/PG who owns the data/metadata, Flink uses the > >>>>> data/metadata from external systems. That means Flink could > >>>>> have 2 kinds of system columns (take ROWID for example): > >>>>> > >>>>> 1. system columns provided by external systems via catalogs, such > >>>>> as ROWID from the original system. > >>>>> 2. system columns generated by Flink, such as ROWID generated by > >>>>> Flink itself. > >>>>> > >>>>> IIUC, the FLIP is proposing the 1st approach: the catalog defines > what > >>>>> system columns to provide, and Flink treats them as normal columns > >>>>> with a special naming pattern. > >>>>> > >>>>> On the other hand, Jark is proposing the 2nd one: the system columns > >>>>> are defined and owned by Flink, and can be inferred from external > >>>>> systems. Therefore, system columns should be predefined by Flink, > >>>>> and optionally implemented by the catalogs. > >>>>> > >>>>> Personally, I’m in favor of the 2nd approach, because it makes the > >>>>> system columns very accessible and more aligned across the catalogs. > >>>>> > >>>>> BTW, I second Alexey that systems columns should not be shown with > >>>>> DESCRIBE statements. > >>>>> > >>>>> WDYT? Thanks! > >>>>> > >>>>> Best, > >>>>> Paul Lam > >>>>> > >>>>>> 2023年7月31日 23:54,Jark Wu <imj...@gmail.com> 写道: > >>>>>> > >>>>>> Hi Timo, > >>>>>> > >>>>>> Thanks for your proposal. I think this is a nice feature for users > >>>>>> and I > >>>>>> prefer option 3. > >>>>>> > >>>>>> I only have one concern about the concept of pseudo-column or > >>>>>> system-column, > >>>>>> because this is the first time we introduce it in Flink SQL. The > >>>>>> confusion is similar to the > >>>>>> question of Benchao and Sergey about the propagation of > pseudo-column. > >>>>>> > >>>>>> From my understanding, a pseudo-column can be get from an > arbitrary > >>>>> query, > >>>>>> just similar to > >>>>>> ROWNUM in Oracle[1], such as : > >>>>>> > >>>>>> SELECT * > >>>>>> FROM (SELECT * FROM employees ORDER BY employee_id) > >>>>>> WHERE ROWNUM < 11; > >>>>>> > >>>>>> However, IIUC, the proposed "$rowtime" pseudo-column can only be got > >>>>>> from > >>>>>> the physical table > >>>>>> and can't be got from queries even if the query propagates the > rowtime > >>>>>> attribute. There was also > >>>>>> a discussion about adding a pseudo-column "_proctime" [2] to make > >>>>>> lookup > >>>>>> join easier to use > >>>>>> which can be got from arbitrary queries. That "_proctime" may > conflict > >>>>> with > >>>>>> the proposed > >>>>>> pseudo-column concept. > >>>>>> > >>>>>> Did you consider making it as a built-in defined pseudo-column > >>>>>> "$rowtime" > >>>>>> which returns the > >>>>>> time attribute value (if exists) or null (if non-exists) for every > >>>>>> table/query, and pseudo-column > >>>>>> "$proctime" always returns PROCTIME() value for each table/query. In > >>>>>> this > >>>>>> way, catalogs only need > >>>>>> to provide a default rowtime attribute and users can get it in the > >> same > >>>>>> way. And we don't need > >>>>>> to introduce the contract interface of "Metadata Key Prefix > >> Constraint" > >>>>>> which is still a little complex > >>>>>> for users and devs to understand. > >>>>>> > >>>>>> Best, > >>>>>> Jark > >>>>>> > >>>>>> [1]: > >>>>>> > >>>>> > >> > https://docs.oracle.com/cd/E11882_01/server.112/e41084/pseudocolumns009.htm#SQLRF00255 > >>>>>> [2]: > https://lists.apache.org/thread/7ln106qxyw8sp7ljq40hs2p1lb1gdwj5 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Fri, 28 Jul 2023 at 06:18, Alexey Leonov-Vendrovskiy < > >>>>>> vendrov...@gmail.com> wrote: > >>>>>> > >>>>>>>> > >>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);` > >>>>>>>> Am I right that it will show `$rowtime` in output ? > >>>>>>> > >>>>>>> > >>>>>>> Yes, all explicitly selected columns become a part of the result > (and > >>>>>>> intermediate) schema, and hence propagate. > >>>>>>> > >>>>>>> On Thu, Jul 27, 2023 at 2:40 PM Alexey Leonov-Vendrovskiy < > >>>>>>> vendrov...@gmail.com> wrote: > >>>>>>> > >>>>>>>> Thank you, Timo, for starting this FLIP! > >>>>>>>> > >>>>>>>> I propose the following change: > >>>>>>>> > >>>>>>>> Remove the requirement that DESCRIBE need to show system columns. > >>>>>>>> > >>>>>>>> > >>>>>>>> Some concrete vendor specific catalog implementations might prefer > >>>>>>>> this > >>>>>>>> approach. > >>>>>>>> Usually the same system columns are available on all (or family) > of > >>>>>>>> tables, and it can be easily captured in the documentation. > >>>>>>>> > >>>>>>>> For example, BigQuery does exactly this: there, pseudo-columns do > >> not > >>>>>>> show > >>>>>>>> up in the table schema in any place, but can be accessed via > >>>>>>>> reference. > >>>>>>>> > >>>>>>>> So I propose we: > >>>>>>>> a) Either we say that DESCRIBE doesn't show system columns, > >>>>>>>> b) Or leave this vendor-specific / or configurable via flag (if > >>>>> needed). > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Alexey > >>>>>>>> > >>>>>>>> On Thu, Jul 27, 2023 at 3:27 AM Sergey Nuyanzin < > >> snuyan...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Timo, > >>>>>>>>> > >>>>>>>>> Thanks for the FLIP. > >>>>>>>>> I also tend to think that Option 3 is better. > >>>>>>>>> > >>>>>>>>> I would be also interested in a question mentioned by Benchao Li. > >>>>>>>>> And a similar question about nested queries like > >>>>>>>>> `SELECT * FROM (SELECT $rowtime, * FROM t);` > >>>>>>>>> Am I right that it will show `$rowtime` in output ? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Jul 27, 2023 at 6:58 AM Benchao Li <libenc...@apache.org > > > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Timo, > >>>>>>>>>> > >>>>>>>>>> Thanks for the FLIP, I also like the idea and option 3 sounds > >>>>>>>>>> good to > >>>>>>>>> me. > >>>>>>>>>> > >>>>>>>>>> I would like to discuss a case which is not mentioned in the > >>>>>>>>>> current > >>>>>>>>> FLIP. > >>>>>>>>>> How are the "System column"s expressed in intermediate result, > >> e.g. > >>>>>>>>> Join? > >>>>>>>>>> E.g. `SELECT * FROM t1 JOIN t2`, I guess it should not include > >>>>> "system > >>>>>>>>>> columns" from t1 and t2 as you proposed, and for `SELECT > >>>>>>>>>> t1.$rowtime, > >>>>>>> * > >>>>>>>>>> FROM t1 JOIN t2`, it should also be valid. > >>>>>>>>>> Then the question is how to you plan to implement the "system > >>>>>>> columns", > >>>>>>>>> do > >>>>>>>>>> we need to add it to `RelNode` level? Or we just need to do it > >>>>>>>>>> in the > >>>>>>>>>> parsing/validating phase? > >>>>>>>>>> I'm not sure that Calcite's "system column" feature is fully > ready > >>>>> for > >>>>>>>>> this > >>>>>>>>>> since the code about this part is imported from the earlier > >> project > >>>>>>>>> before > >>>>>>>>>> it gets into Apache, and has not been considered much in the > past > >>>>>>>>>> development. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Jing Ge <j...@ververica.com.invalid> 于2023年7月26日周三 00:01写 > >>>>>>>>>> 道: > >>>>>>>>>> > >>>>>>>>>>> Hi Timo, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for your proposal. It is a very pragmatic feature. Among > >>>>>>>>>>> all > >>>>>>>>>> options > >>>>>>>>>>> in the FLIP, option 3 is one I prefer too and I'd like to ask > >> some > >>>>>>>>>>> questions to understand your thoughts. > >>>>>>>>>>> > >>>>>>>>>>> 1. I did some research on pseudo columns, just out of > >>>>>>>>>>> curiosity, do > >>>>>>>>> you > >>>>>>>>>>> know why most SQL systems do not need any prefix with their > >> pseudo > >>>>>>>>>> column? > >>>>>>>>>>> 2. Some platform providers will use ${variable_name} to define > >>>>>>>>>>> their > >>>>>>>>> own > >>>>>>>>>>> configurations and allow them to be embedded into SQL scripts. > >>>>>>>>>>> Will > >>>>>>>>> there > >>>>>>>>>>> be any conflict with option 3? > >>>>>>>>>>> > >>>>>>>>>>> Best regards, > >>>>>>>>>>> Jing > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Jul 25, 2023 at 7:00 PM Konstantin Knauf > >>>>>>>>>>> <kna...@apache.org > >>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Timo, > >>>>>>>>>>>> > >>>>>>>>>>>> this makes sense to me. Option 3 seems reasonable, too. > >>>>>>>>>>>> > >>>>>>>>>>>> Cheers, > >>>>>>>>>>>> > >>>>>>>>>>>> Konstantin > >>>>>>>>>>>> > >>>>>>>>>>>> Am Di., 25. Juli 2023 um 12:53 Uhr schrieb Timo Walther < > >>>>>>>>>>>> twal...@apache.org > >>>>>>>>>>>>> : > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi everyone, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I would like to start a discussion about introducing the > >> concept > >>>>>>>>> of > >>>>>>>>>>>>> "System Columns" in SQL and Table API. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The subject sounds bigger than it actually is. Luckily, Flink > >>>>>>> SQL > >>>>>>>>>>>>> already exposes the concept of metadata columns. And this > >>>>>>>>> proposal is > >>>>>>>>>>>>> just a slight adjustment for how metadata columns can be used > >> as > >>>>>>>>>> system > >>>>>>>>>>>>> columns. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The biggest problem of metadata columns currently is that a > >>>>>>>>> catalog > >>>>>>>>>>>>> implementation can't provide them by default because they > would > >>>>>>>>>> affect > >>>>>>>>>>>>> `SELECT *` when adding another one. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Looking forward to your feedback on FLIP-348: > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-348%3A+Support+System+Columns+in+SQL+and+Table+API > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> Timo > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> https://twitter.com/snntrable > >>>>>>>>>>>> https://github.com/knaufk > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Benchao Li > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Best regards, > >>>>>>>>> Sergey > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > >> > > > > -- Best, Benchao Li