Re: [DISCUSSION] FLIP-457: Improve Table/SQL Configuration for Flink 2.0

2024-05-19 Thread Ron Liu
Hi, Lincoln

>  2. Regarding the options in HashAggCodeGenerator, since this new feature
has gone
through a couple of release cycles and could be considered for
PublicEvolving now,
cc @Ron Liu   WDYT?

Thanks for cc'ing me,  +1 for public these options now.

Best,
Ron

Benchao Li  于2024年5月20日周一 13:08写道:

> I agree with Lincoln about the experimental features.
>
> Some of these configurations do not even have proper implementation,
> take 'table.exec.range-sort.enabled' as an example, there was a
> discussion[1] about it before.
>
> [1] https://lists.apache.org/thread/q5h3obx36pf9po28r0jzmwnmvtyjmwdr
>
> Lincoln Lee  于2024年5月20日周一 12:01写道:
> >
> > Hi Jane,
> >
> > Thanks for the proposal!
> >
> > +1 for the changes except for these annotated as experimental ones.
> >
> > For the options annotated as experimental,
> >
> > +1 for the moving of IncrementalAggregateRule & RelNodeBlock.
> >
> > For the rest of the options, there are some suggestions:
> >
> > 1. for the batch related parameters, it's recommended to either delete
> > them (leaving the necessary defaults value in place) or leave them as
> they
> > are. Including:
> > FlinkRelMdRowCount
> > FlinkRexUtil
> > BatchPhysicalSortRule
> > JoinDeriveNullFilterRule
> > BatchPhysicalJoinRuleBase
> > BatchPhysicalSortMergeJoinRule
> >
> > What I understand about the history of these options is that they were
> once
> > used for fine
> > tuning for tpc testing, and the current flink planner no longer relies on
> > these internal
> > options when testing tpc[1]. In addition, these options are too obscure
> for
> > SQL users,
> > and some of them are actually magic numbers.
> >
> > 2. Regarding the options in HashAggCodeGenerator, since this new feature
> > has gone
> > through a couple of release cycles and could be considered for
> > PublicEvolving now,
> > cc @Ron Liu   WDYT?
> >
> > 3. Regarding WindowEmitStrategy, IIUC it is currently unsupported on TVF
> > window, so
> > it's recommended to keep it untouched for now and follow up in
> > FLINK-29692[2]. cc @Xuyang 
> >
> > [1]
> >
> https://github.com/ververica/flink-sql-benchmark/blob/master/tools/common/flink-conf.yaml
> > [2] https://issues.apache.org/jira/browse/FLINK-29692
> >
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Yubin Li  于2024年5月17日周五 10:49写道:
> >
> > > Hi Jane,
> > >
> > > Thank Jane for driving this proposal !
> > >
> > > This makes sense for users, +1 for that.
> > >
> > > Best,
> > > Yubin
> > >
> > > On Thu, May 16, 2024 at 3:17 PM Jark Wu  wrote:
> > > >
> > > > Hi Jane,
> > > >
> > > > Thanks for the proposal. +1 from my side.
> > > >
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Thu, 16 May 2024 at 10:28, Xuannan Su 
> wrote:
> > > >
> > > > > Hi Jane,
> > > > >
> > > > > Thanks for driving this effort! And +1 for the proposed changes.
> > > > >
> > > > > I have one comment on the migration plan.
> > > > >
> > > > > For options to be moved to another module/package, I think we have
> to
> > > > > mark the old option deprecated in 1.20 for it to be removed in 2.0,
> > > > > according to the API compatibility guarantees[1]. We can introduce
> the
> > > > > new option in 1.20 with the same option key in the intended class.
> > > > > WDYT?
> > > > >
> > > > > Best,
> > > > > Xuannan
> > > > >
> > > > > [1]
> > > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/upgrading/#api-compatibility-guarantees
> > > > >
> > > > >
> > > > >
> > > > > On Wed, May 15, 2024 at 6:20 PM Jane Chan 
> > > wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I'd like to start a discussion on FLIP-457: Improve Table/SQL
> > > > > Configuration
> > > > > > for Flink 2.0 [1]. This FLIP revisited all Table/SQL
> configurations
> > > to
> > > > > > improve user-friendliness and maintainability as Flink moves
> toward
> > > 2.0.
> > > > > >
> > > > > > I am looking forward to your feedback.
> > > > > >
> > > > > > Best regards,
> > > > > > Jane
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=307136992
> > > > >
> > >
>
>
>
> --
>
> Best,
> Benchao Li
>


[RESULT][VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-13 Thread Ron Liu
Hi, Dev

I'm happy to announce that FLIP-448: Introduce Pluggable Workflow Scheduler
Interface for Materialized Table[1] has been accepted with 8 approving
votes (4 binding) [2].

- Xuyang
- Feng Jin
- Lincoln Lee(binding)
- Jark Wu(binding)
- Ron Liu(binding)
- Shengkai Fang(binding)
- Keith Lee
- Ahmed Hamdy

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
[2] https://lists.apache.org/thread/8qvh3brgvo46xprv4mxq4kyhyy0tsvny

Best,
Ron


Re: Re: [VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-09 Thread Ron Liu
+1(binding)

Best,
Ron

Jark Wu  于2024年5月10日周五 09:51写道:

> +1 (binding)
>
> Best,
> Jark
>
> On Thu, 9 May 2024 at 21:27, Lincoln Lee  wrote:
>
> > +1 (binding)
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Feng Jin  于2024年5月9日周四 19:45写道:
> >
> > > +1 (non-binding)
> > >
> > >
> > > Best,
> > > Feng
> > >
> > >
> > > On Thu, May 9, 2024 at 7:37 PM Xuyang  wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > >
> > > > --
> > > >
> > > > Best!
> > > > Xuyang
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At 2024-05-09 13:57:07, "Ron Liu"  wrote:
> > > > >Sorry for the re-post, just to format this email content.
> > > > >
> > > > >Hi Dev
> > > > >
> > > > >Thank you to everyone for the feedback on FLIP-448: Introduce
> > Pluggable
> > > > >Workflow Scheduler Interface for Materialized Table[1][2].
> > > > >I'd like to start a vote for it. The vote will be open for at least
> 72
> > > > >hours unless there is an objection or not enough votes.
> > > > >
> > > > >[1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > > > >
> > > > >[2]
> https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> > > > >
> > > > >Best,
> > > > >Ron
> > > > >
> > > > >Ron Liu  于2024年5月9日周四 13:52写道:
> > > > >
> > > > >> Hi Dev, Thank you to everyone for the feedback on FLIP-448:
> > Introduce
> > > > >> Pluggable Workflow Scheduler Interface for Materialized
> Table[1][2].
> > > I'd
> > > > >> like to start a vote for it. The vote will be open for at least 72
> > > hours
> > > > >> unless there is an objection or not enough votes. [1]
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > > > >>
> > > > >> [2]
> > https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> > > > >> Best, Ron
> > > > >>
> > > >
> > >
> >
>


Re: [VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-08 Thread Ron Liu
Sorry for the re-post, just to format this email content.

Hi Dev

Thank you to everyone for the feedback on FLIP-448: Introduce Pluggable
Workflow Scheduler Interface for Materialized Table[1][2].
I'd like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or not enough votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table

[2] https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1

Best,
Ron

Ron Liu  于2024年5月9日周四 13:52写道:

> Hi Dev, Thank you to everyone for the feedback on FLIP-448: Introduce
> Pluggable Workflow Scheduler Interface for Materialized Table[1][2]. I'd
> like to start a vote for it. The vote will be open for at least 72 hours
> unless there is an objection or not enough votes. [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
>
> [2] https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
> Best, Ron
>


[VOTE] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-08 Thread Ron Liu
Hi Dev, Thank you to everyone for the feedback on FLIP-448: Introduce
Pluggable Workflow Scheduler Interface for Materialized Table[1][2]. I'd
like to start a vote for it. The vote will be open for at least 72 hours
unless there is an objection or not enough votes. [1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table

[2] https://lists.apache.org/thread/57xfo6p25rbrhcg01dhyok46zt6jc5q1
Best, Ron


Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-08 Thread Ron Liu
Hi, Dev

Thank you all for joining this thread and giving your comments and
suggestions, they have helped improve this proposal and I look forward to
further feedback.
If there are no further comments, I'd like to close the discussion and
start the voting one day later.

Best,
Ron

Ron Liu  于2024年5月7日周二 20:51写道:

> Hi, dev
>
> Following the recent PoC[1], and drawing on the excellent code design
> within Flink, I have made the following optimizations to the Public
> Interfaces section of FLIP:
>
> 1. I have renamed WorkflowOperation to RefreshWorkflow. This change better
> conveys its purpose. RefreshWorkflow is used to provide the necessary
> information required for creating, modifying, and deleting workflows. Using
> WorkflowOperation could mislead people into thinking it is a command
> operation, whereas in fact, it does not represent an operation but merely
> provides the essential context information for performing operations on
> workflows. The specific operations are completed within WorkflowScheduler.
> Additionally, I felt that using WorkflowOperation could potentially
> conflict with the Operation[2] interface in the table.
> 2. I have refined the signatures of the modifyRefreshWorkflow and
> deleteRefreshWorkflow interface methods in WorkflowScheduler. The parameter
> T refreshHandler is now provided by ModifyRefreshWorkflow and
> deleteRefreshWorkflow, which makes the overall interface design more
> symmetrical and clean.
>
> [1] https://github.com/lsyldliu/flink/tree/FLIP-448-PoC
> [2]
> https://github.com/apache/flink/blob/29736b8c01924b7da03d4bcbfd9c812a8e5a08b4/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/operations/Operation.java
>
> Best,
> Ron
>
> Ron Liu  于2024年5月7日周二 14:30写道:
>
>> > 4. It appears that in the section on `public interfaces`, within
>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>
>> `CreateWorkflowOperation`, right?
>>
>> After discussing with Xuyang offline, we need to support periodic
>> workflow and one-time workflow, they need different information, for
>> example, periodic workflow needs cron expression, one-time workflow needs
>> refresh partition, downstream cascade materialized table, etc. Therefore,
>> CreateWorkflowOperation correspondingly will have two different
>> implementation classes, which will be cleaner for both the implementer and
>> the caller.
>>
>> Best,
>> Ron
>>
>> Ron Liu  于2024年5月6日周一 20:48写道:
>>
>>> Hi, Xuyang
>>>
>>> Thanks for joining this discussion
>>>
>>> > 1. In the sequence diagram, it appears that there is a missing step
>>> for obtaining the refresh handler from the catalog during the suspend
>>> operation.
>>>
>>> Good catch
>>>
>>> > 2. The term "cascade refresh" does not seem to be mentioned in
>>> FLIP-435. The workflow it creates is marked as a "one-time workflow". This
>>> is different
>>>
>>> from a "periodic workflow," and it appears to be a one-off execution. Is
>>> this actually referring to the Refresh command in FLIP-435?
>>>
>>> The cascade refresh is a future work, we don't propose the corresponding
>>> syntax in FLIP-435. However, intuitively, it would be an extension of the
>>> Refresh command in FLIP-435.
>>>
>>> > 3. The workflow-scheduler.type has no default value; should it be set
>>> to CRON by default?
>>>
>>> Firstly, CRON is not a workflow scheduler. Secondly, I believe that
>>> configuring the Scheduler should be an action that users are aware of, and
>>> default values should not be set.
>>>
>>> > 4. It appears that in the section on `public interfaces`, within
>>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>>
>>> `CreateWorkflowOperation`, right?
>>>
>>> Sorry, I don't get your point. Can you give more description?
>>>
>>> Best,
>>> Ron
>>>
>>> Xuyang  于2024年5月6日周一 20:26写道:
>>>
>>>> Hi, Ron.
>>>>
>>>> Thanks for driving this. After reading the entire flip, I have the
>>>> following questions:
>>>>
>>>>
>>>>
>>>>
>>>> 1. In the sequence diagram, it appears that there is a missing step for
>>>> obtaining the refresh handler from the catalog during the suspend 
>>>> operation.
>>>>
>>>>
>>>>
>>>>
>>>> 2. The term "cascade refresh

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-07 Thread Ron Liu
Hi, dev

Following the recent PoC[1], and drawing on the excellent code design
within Flink, I have made the following optimizations to the Public
Interfaces section of FLIP:

1. I have renamed WorkflowOperation to RefreshWorkflow. This change better
conveys its purpose. RefreshWorkflow is used to provide the necessary
information required for creating, modifying, and deleting workflows. Using
WorkflowOperation could mislead people into thinking it is a command
operation, whereas in fact, it does not represent an operation but merely
provides the essential context information for performing operations on
workflows. The specific operations are completed within WorkflowScheduler.
Additionally, I felt that using WorkflowOperation could potentially
conflict with the Operation[2] interface in the table.
2. I have refined the signatures of the modifyRefreshWorkflow and
deleteRefreshWorkflow interface methods in WorkflowScheduler. The parameter
T refreshHandler is now provided by ModifyRefreshWorkflow and
deleteRefreshWorkflow, which makes the overall interface design more
symmetrical and clean.

[1] https://github.com/lsyldliu/flink/tree/FLIP-448-PoC
[2]
https://github.com/apache/flink/blob/29736b8c01924b7da03d4bcbfd9c812a8e5a08b4/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/operations/Operation.java

Best,
Ron

Ron Liu  于2024年5月7日周二 14:30写道:

> > 4. It appears that in the section on `public interfaces`, within
> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>
> `CreateWorkflowOperation`, right?
>
> After discussing with Xuyang offline, we need to support periodic workflow
> and one-time workflow, they need different information, for example,
> periodic workflow needs cron expression, one-time workflow needs refresh
> partition, downstream cascade materialized table, etc. Therefore,
> CreateWorkflowOperation correspondingly will have two different
> implementation classes, which will be cleaner for both the implementer and
> the caller.
>
> Best,
> Ron
>
> Ron Liu  于2024年5月6日周一 20:48写道:
>
>> Hi, Xuyang
>>
>> Thanks for joining this discussion
>>
>> > 1. In the sequence diagram, it appears that there is a missing step for
>> obtaining the refresh handler from the catalog during the suspend operation.
>>
>> Good catch
>>
>> > 2. The term "cascade refresh" does not seem to be mentioned in
>> FLIP-435. The workflow it creates is marked as a "one-time workflow". This
>> is different
>>
>> from a "periodic workflow," and it appears to be a one-off execution. Is
>> this actually referring to the Refresh command in FLIP-435?
>>
>> The cascade refresh is a future work, we don't propose the corresponding
>> syntax in FLIP-435. However, intuitively, it would be an extension of the
>> Refresh command in FLIP-435.
>>
>> > 3. The workflow-scheduler.type has no default value; should it be set
>> to CRON by default?
>>
>> Firstly, CRON is not a workflow scheduler. Secondly, I believe that
>> configuring the Scheduler should be an action that users are aware of, and
>> default values should not be set.
>>
>> > 4. It appears that in the section on `public interfaces`, within
>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>
>> `CreateWorkflowOperation`, right?
>>
>> Sorry, I don't get your point. Can you give more description?
>>
>> Best,
>> Ron
>>
>> Xuyang  于2024年5月6日周一 20:26写道:
>>
>>> Hi, Ron.
>>>
>>> Thanks for driving this. After reading the entire flip, I have the
>>> following questions:
>>>
>>>
>>>
>>>
>>> 1. In the sequence diagram, it appears that there is a missing step for
>>> obtaining the refresh handler from the catalog during the suspend operation.
>>>
>>>
>>>
>>>
>>> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
>>> The workflow it creates is marked as a "one-time workflow". This is
>>> different
>>>
>>> from a "periodic workflow," and it appears to be a one-off execution. Is
>>> this actually referring to the Refresh command in FLIP-435?
>>>
>>>
>>>
>>>
>>> 3. The workflow-scheduler.type has no default value; should it be set to
>>> CRON by default?
>>>
>>>
>>>
>>>
>>> 4. It appears that in the section on `public interfaces`, within
>>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>>
>>> `CreateWorkflow

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-07 Thread Ron Liu
> 4. It appears that in the section on `public interfaces`, within
`WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to

`CreateWorkflowOperation`, right?

After discussing with Xuyang offline, we need to support periodic workflow
and one-time workflow, they need different information, for example,
periodic workflow needs cron expression, one-time workflow needs refresh
partition, downstream cascade materialized table, etc. Therefore,
CreateWorkflowOperation correspondingly will have two different
implementation classes, which will be cleaner for both the implementer and
the caller.

Best,
Ron

Ron Liu  于2024年5月6日周一 20:48写道:

> Hi, Xuyang
>
> Thanks for joining this discussion
>
> > 1. In the sequence diagram, it appears that there is a missing step for
> obtaining the refresh handler from the catalog during the suspend operation.
>
> Good catch
>
> > 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
> The workflow it creates is marked as a "one-time workflow". This is
> different
>
> from a "periodic workflow," and it appears to be a one-off execution. Is
> this actually referring to the Refresh command in FLIP-435?
>
> The cascade refresh is a future work, we don't propose the corresponding
> syntax in FLIP-435. However, intuitively, it would be an extension of the
> Refresh command in FLIP-435.
>
> > 3. The workflow-scheduler.type has no default value; should it be set to
> CRON by default?
>
> Firstly, CRON is not a workflow scheduler. Secondly, I believe that
> configuring the Scheduler should be an action that users are aware of, and
> default values should not be set.
>
> > 4. It appears that in the section on `public interfaces`, within
> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>
> `CreateWorkflowOperation`, right?
>
> Sorry, I don't get your point. Can you give more description?
>
> Best,
> Ron
>
> Xuyang  于2024年5月6日周一 20:26写道:
>
>> Hi, Ron.
>>
>> Thanks for driving this. After reading the entire flip, I have the
>> following questions:
>>
>>
>>
>>
>> 1. In the sequence diagram, it appears that there is a missing step for
>> obtaining the refresh handler from the catalog during the suspend operation.
>>
>>
>>
>>
>> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
>> The workflow it creates is marked as a "one-time workflow". This is
>> different
>>
>> from a "periodic workflow," and it appears to be a one-off execution. Is
>> this actually referring to the Refresh command in FLIP-435?
>>
>>
>>
>>
>> 3. The workflow-scheduler.type has no default value; should it be set to
>> CRON by default?
>>
>>
>>
>>
>> 4. It appears that in the section on `public interfaces`, within
>> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>>
>> `CreateWorkflowOperation`, right?
>>
>>
>>
>>
>> --
>>
>> Best!
>> Xuyang
>>
>>
>>
>>
>>
>> At 2024-04-22 14:41:39, "Ron Liu"  wrote:
>> >Hi, Dev
>> >
>> >I would like to start a discussion about FLIP-448: Introduce Pluggable
>> >Workflow Scheduler Interface for Materialized Table.
>> >
>> >In FLIP-435[1], we proposed Materialized Table, which has two types of
>> data
>> >refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
>> >mode, the Materialized Table relies on a workflow scheduler to perform
>> >periodic refresh operation to achieve the desired data freshness.
>> >
>> >There are numerous open-source workflow schedulers available, with
>> popular
>> >ones including Airflow and DolphinScheduler. To enable Materialized Table
>> >to work with different workflow schedulers, we propose a pluggable
>> workflow
>> >scheduler interface for Materialized Table in this FLIP.
>> >
>> >For more details, see FLIP-448 [2]. Looking forward to your feedback.
>> >
>> >[1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
>> >[2]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
>> >
>> >Best,
>> >Ron
>>
>


Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-06 Thread Ron Liu
Hi, Xuyang

Thanks for joining this discussion

> 1. In the sequence diagram, it appears that there is a missing step for
obtaining the refresh handler from the catalog during the suspend operation.

Good catch

> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
The workflow it creates is marked as a "one-time workflow". This is
different

from a "periodic workflow," and it appears to be a one-off execution. Is
this actually referring to the Refresh command in FLIP-435?

The cascade refresh is a future work, we don't propose the corresponding
syntax in FLIP-435. However, intuitively, it would be an extension of the
Refresh command in FLIP-435.

> 3. The workflow-scheduler.type has no default value; should it be set to
CRON by default?

Firstly, CRON is not a workflow scheduler. Secondly, I believe that
configuring the Scheduler should be an action that users are aware of, and
default values should not be set.

> 4. It appears that in the section on `public interfaces`, within
`WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to

`CreateWorkflowOperation`, right?

Sorry, I don't get your point. Can you give more description?

Best,
Ron

Xuyang  于2024年5月6日周一 20:26写道:

> Hi, Ron.
>
> Thanks for driving this. After reading the entire flip, I have the
> following questions:
>
>
>
>
> 1. In the sequence diagram, it appears that there is a missing step for
> obtaining the refresh handler from the catalog during the suspend operation.
>
>
>
>
> 2. The term "cascade refresh" does not seem to be mentioned in FLIP-435.
> The workflow it creates is marked as a "one-time workflow". This is
> different
>
> from a "periodic workflow," and it appears to be a one-off execution. Is
> this actually referring to the Refresh command in FLIP-435?
>
>
>
>
> 3. The workflow-scheduler.type has no default value; should it be set to
> CRON by default?
>
>
>
>
> 4. It appears that in the section on `public interfaces`, within
> `WorkflowOperation`, `CreatePeriodicWorkflowOperation` should be changed to
>
> `CreateWorkflowOperation`, right?
>
>
>
>
> --
>
> Best!
> Xuyang
>
>
>
>
>
> At 2024-04-22 14:41:39, "Ron Liu"  wrote:
> >Hi, Dev
> >
> >I would like to start a discussion about FLIP-448: Introduce Pluggable
> >Workflow Scheduler Interface for Materialized Table.
> >
> >In FLIP-435[1], we proposed Materialized Table, which has two types of
> data
> >refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
> >mode, the Materialized Table relies on a workflow scheduler to perform
> >periodic refresh operation to achieve the desired data freshness.
> >
> >There are numerous open-source workflow schedulers available, with popular
> >ones including Airflow and DolphinScheduler. To enable Materialized Table
> >to work with different workflow schedulers, we propose a pluggable
> workflow
> >scheduler interface for Materialized Table in this FLIP.
> >
> >For more details, see FLIP-448 [2]. Looking forward to your feedback.
> >
> >[1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
> >[2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> >
> >Best,
> >Ron
>


Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-05-04 Thread Ron Liu
Hi, Lincoln

Thanks for join this discussion.

After rethinking, I think your suggestion is make sense, although currently
deleting the workflow on the Scheduler and relying only on the
RefreshHandler is enough, if in the future we support cascading deletion,
the DeleteWorkflowOperation can provide the necessary information without
the need to provide a new interface.

I've updated the public interface section of FLIP.

Best,
Ron

Lincoln Lee  于2024年4月30日周二 21:27写道:

> Thanks Ron for starting this flip! It will complete the user story for
> flip-435[1].
>
> Regarding the WorkflowOperation, I have a question about whether we
> should add Delete/DropWorkflowOperation as well for when the
> Materialized Table is dropped or refresh mode changed from full to
> continuous?
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines?src=contextnavpagetreemode
>
>
> Best,
> Lincoln Lee
>
>
>  于2024年4月30日周二 15:37写道:
>
> > Hello Ron, thank you for your detailed answers!
> >
> > For the Visitor pattern, I thought about it the other way around, so that
> > operations visit the scheduler, and not vice-versa :) In this way
> > operations can get the required information in order to be executed in a
> > tailored way.
> >
> > Thank you for your effort, but, as you say:
> > > furthermore, I think the current does not see the benefits of the time,
> > simpler instead of better, similar to the design of
> > CatalogModificationEvent[2] and CatalogModificationListener[3], the
> > developer only needs instanceof judgment.
> >
> > In java, most of the times, `instanceof` is considered an anti-pattern,
> > that's why I was also thinking about a command pattern (every operations
> > defines an `execute` method). However, I also understand this part is not
> > crucial for the FLIP under discussion, and the implementation details can
> > simply wait for the PRs to come.
> >
> > > After discussing with Shengkai offline, there is no need for this REST
> > API
> > to support multiple tables to be refreshed at the same time, so it would
> be
> > more appropriate to put the materialized table identifier in the path of
> > the URL, thanks for the suggestion.
> >
> > Very good!
> >
> > Thank you!
> > On Apr 29, 2024 at 05:04 +0200, Ron Liu , wrote:
> > > Hi, Lorenzo
> > >
> > > > I have a question there: how can the gateway update the
> refreshHandler
> > in
> > > the Catalog before getting it from the scheduler?
> > >
> > > The refreshHandler in CatalogMateriazedTable is null before getting it
> > from
> > > the scheduler, you can look at the CatalogMaterializedTable.Builder[1]
> > for
> > > more details.
> > >
> > > > You have a typo here: WorkflowScheudler -> WorkflowScheduler :)
> > >
> > > Fix it now, thanks very much.
> > >
> > > > For the operations part, I still think that the FLIP would benefit
> from
> > > providing a specific pattern for operations. You could either propose a
> > > command pattern [1] or a visitor pattern (where the scheduler visits
> the
> > > operation to get relevant info) [2] for those operations at your
> choice.
> > >
> > > Thank you for your input, I find it very useful. I tried to understand
> > your
> > > thinking through code and implemented the following pseudo code using
> the
> > > visitor design pattern:
> > > 1. first defined WorkflowOperationVisitor, providing several overloaded
> > > visit methods.
> > >
> > > public interface WorkflowOperationVisitor {
> > >
> > >  T visit(CreateWorkflowOperation
> > > createWorkflowOperation);
> > >
> > > void visit(ModifyWorkflowOperation operation);
> > > }
> > >
> > > 2. then in the WorkflowOperation add the accept method.
> > >
> > > @PublicEvolving
> > > public interface WorkflowOperation {
> > >
> > > void accept(WorkflowOperationVisitor visitor);
> > > }
> > >
> > >
> > > 3. in the WorkflowScheduler call the implementation class of
> > > WorkflowOperationVisitor, complete the corresponding operations.
> > >
> > > I recognize this design pattern purely from a code design point of
> view,
> > > but from the point of our specific scenario:
> > > 1. For CreateWorkflowOperation, the visit method needs to return
> > > RefreshHandler, for ModifyWorkflowOperation

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-28 Thread Ron Liu
Hi, Lorenzo

> I have a question there: how can the gateway update the refreshHandler in
the Catalog before getting it from the scheduler?

The refreshHandler in CatalogMateriazedTable is null before getting it from
the scheduler, you can look at the CatalogMaterializedTable.Builder[1] for
more details.

> You have a typo here: WorkflowScheudler -> WorkflowScheduler :)

Fix it now, thanks very much.

> For the operations part, I still think that the FLIP would benefit from
providing a specific pattern for operations. You could either propose a
command pattern [1] or a visitor pattern (where the scheduler visits the
operation to get relevant info) [2] for those operations at your choice.

Thank you for your input, I find it very useful. I tried to understand your
thinking through code and implemented the following pseudo code using the
visitor design pattern:
1. first defined WorkflowOperationVisitor, providing several overloaded
visit methods.

public interface WorkflowOperationVisitor {

 T visit(CreateWorkflowOperation
createWorkflowOperation);

void visit(ModifyWorkflowOperation operation);
}

2. then in the WorkflowOperation add the accept method.

@PublicEvolving
public interface WorkflowOperation {

void accept(WorkflowOperationVisitor visitor);
}


3. in the WorkflowScheduler call the implementation class of
WorkflowOperationVisitor, complete the corresponding operations.

I recognize this design pattern purely from a code design point of view,
but from the point of our specific scenario:
1. For CreateWorkflowOperation, the visit method needs to return
RefreshHandler, for ModifyWorkflowOperation, such as suspend and resume,
the visit method doesn't need to return RefreshHandler. parameter,
currently for different WorkflowOperation, WorkflowOperationVisitor#accept
can't be unified, so I think visitor may not be applicable here.

2. In addition, I think using the visitor pattern will add complexity to
the WorkflowScheduler implementer, which needs to implement one more
interface WorkflowOperationVisitor, this interface is not for the engine to
use, so I don't see any benefit from this design at the moment.

3. furthermore, I think the current does not see the benefits of the time,
simpler instead of better, similar to the design of
CatalogModificationEvent[2] and CatalogModificationListener[3], the
developer only needs instanceof judgment.

To summarize, I don't think there is a need to introduce command or visitor
pattern at present.

> About the REST API, I will wait for your offline discussion :)

After discussing with Shengkai offline, there is no need for this REST API
to support multiple tables to be refreshed at the same time, so it would be
more appropriate to put the materialized table identifier in the path of
the URL, thanks for the suggestion.

[1]
https://github.com/apache/flink/blob/e412402ca4dfc438e28fb990dc53ea7809430aee/flink-table/flink-table-common/src/main/java/org/apache/flink/table/catalog/CatalogMaterializedTable.java#L264
[2]
https://github.com/apache/flink/blob/b1544e4e513d2b75b350c20dbb1c17a8232c22fd/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/catalog/listener/CatalogModificationEvent.java#L28
[3]
https://github.com/apache/flink/blob/b1544e4e513d2b75b350c20dbb1c17a8232c22fd/flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/catalog/listener/CatalogModificationListener.java#L31

Best,
Ron

Ron Liu  于2024年4月28日周日 23:53写道:

> Hi, Shengkai
>
> Thanks for your feedback and suggestion, it looks very useful for this
> proposal, regarding your question I made the following optimization:
>
> > *WorkflowScheduler*
> > 1. How to get the exception details if `modifyRefreshWorkflow` fails?
> > 2. Could you give us an example about how to configure the scheduler?
>
> 1. Added a new WorkflowException, WorkflowScheduler's related method
> signature will throw WorkflowException, when creating or modifying Workflow
> encountered an exception, so that the framework will sense and deal with it.
>
> 2. Added a new Configuration section, introduced a new Option, and gave an
> example of how to define the Scheduler in flink-conf.yaml.
>
> > *SQL Gateway*
> > 1. SqlGatewayService requires Session as the input, but the REST API
> doesn't need any Session information.
> > 2. Use "-" instead of "_" in the REST URI and camel case for fields in
> request/response
> > 3. Do we need scheduleTime and scheduleTimeFormat together?
>
> 1. If it is designed as a synchronous API, it may lead to network jitter,
> thread resource exhaustion and other problems, which I have not considered
> before. The asynchronous API, although increasing the cost of use for the
> user, is friendly to the SqlGatewayService, as well as the Client thread
> resources. In summary as discussed offline, so I also tend to think that
> all

Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-28 Thread Ron Liu
Hi, Shengkai

Thanks for your feedback and suggestion, it looks very useful for this
proposal, regarding your question I made the following optimization:

> *WorkflowScheduler*
> 1. How to get the exception details if `modifyRefreshWorkflow` fails?
> 2. Could you give us an example about how to configure the scheduler?

1. Added a new WorkflowException, WorkflowScheduler's related method
signature will throw WorkflowException, when creating or modifying Workflow
encountered an exception, so that the framework will sense and deal with it.

2. Added a new Configuration section, introduced a new Option, and gave an
example of how to define the Scheduler in flink-conf.yaml.

> *SQL Gateway*
> 1. SqlGatewayService requires Session as the input, but the REST API
doesn't need any Session information.
> 2. Use "-" instead of "_" in the REST URI and camel case for fields in
request/response
> 3. Do we need scheduleTime and scheduleTimeFormat together?

1. If it is designed as a synchronous API, it may lead to network jitter,
thread resource exhaustion and other problems, which I have not considered
before. The asynchronous API, although increasing the cost of use for the
user, is friendly to the SqlGatewayService, as well as the Client thread
resources. In summary as discussed offline, so I also tend to think that
all APIs of SqlGateway should be unified, and all should be asynchronous
APIs, and bound to session. I have updated the REST API section in FLIP.

2. thanks for the reminder, it has been updated

3. After rethinking, I think it can indeed be simpler, there is no need to
pass in a custom time format, scheduleTime can be unified to the SQL
standard timestamp format: '-MM-dd HH:mm:ss', it is able to satisfy the
time related needs of materialized table.

Based on your feedback, I have optimized and updated the FLIP related
section.

Best,
Ron


Shengkai Fang  于2024年4月28日周日 15:47写道:

> Hi, Liu.
>
> Thanks for your proposal. I have some questions about the FLIP:
>
> *WorkflowScheduler*
>
> 1. How to get the exception details if `modifyRefreshWorkflow` fails?
> 2. Could you give us an example about how to configure the scheduler?
>
> *SQL Gateway*
>
> 1. SqlGatewayService requires Session as the input, but the REST API
> doesn't need any Session information.
>
> From the perspective of a gateway developer, I tend to unify the API of the
> SQL gateway, binding all concepts to the session. On the one hand, this
> approach allows us to reduce maintenance and understanding costs, as we
> only need to maintain one set of architecture to complete basic concepts.
> On the other hand, the benefits of an asynchronous architecture are
> evident: we maintain state on the server side. If the request is a long
> connection, even in the face of network layer jitter, we can still find the
> original result through session and operation handles.
>
> Using asynchronous APIs may increase the development cost for users, but
> from a platform perspective, if a request remains in a blocking state for a
> long time, it also becomes a burden on the platform's JVM. This is because
> thread switching and maintenance require certain resources.
>
> 2. Use "-" instead of "_" in the REST URI and camel case for fields in
> request/response
>
> Please follow the Flink REST Design.
>
> 3. Do we need scheduleTime and scheduleTimeFormat together?
>
> I think we can use SQL timestamp format or ISO timestamp format. It is not
> necessary to pass time in any specific format.
>
> https://en.wikipedia.org/wiki/ISO_8601
>
> Best,
> Shengkai
>


Re: [VOTE] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-25 Thread Ron Liu
+1(binding)

Best,
Ron

Rui Fan <1996fan...@gmail.com> 于2024年4月26日周五 12:55写道:

> +1(binding)
>
> Best,
> Rui
>
> On Fri, Apr 26, 2024 at 10:26 AM Muhammet Orazov
>  wrote:
>
> > Hey Xia,
> >
> > +1 (non-binding)
> >
> > Thanks and best,
> > Muhammet
> >
> > On 2024-04-26 02:21, Xia Sun wrote:
> > > Hi everyone,
> > >
> > > I'd like to start a vote on FLIP-445: Support dynamic parallelism
> > > inference
> > > for HiveSource[1] which has been discussed in this thread [2].
> > >
> > > The vote will be open for at least 72 hours unless there is an
> > > objection or
> > > not enough votes.
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-445%3A+Support+dynamic+parallelism+inference+for+HiveSource
> > > [2] https://lists.apache.org/thread/4k1qx6lodhbkknkhjyl0lq9bx8fcpjvn
> > >
> > >
> > > Best,
> > > Xia
> >
>


Re: [DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-25 Thread Ron Liu
ided, we don't get the complete information and how does this
interface show the information about the background? So I don't think it is
necessary.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines

Best,
Ron



Feng Jin  于2024年4月25日周四 00:46写道:

> Hi Ron
>
> Thank you for initiating this FLIP.
>
> My current questions are as follows:
>
> 1. From my current understanding, the workflow handle should not be bound
> to the Dynamic Table. Therefore, if the workflow is modified, does it mean
> that the scheduling information corresponding to the Dynamic Table will be
> lost?
>
> 2. Regarding the status information of the workflow, I am wondering if it
> is necessary to provide an interface to display the backend scheduling
> information? This would make it more convenient to view the execution
> status of backend jobs.
>
>
> Best,
> Feng
>
>
> On Wed, Apr 24, 2024 at 3:24 PM 
> wrote:
>
> > Hello Ron Liu! Thank you for your FLIP!
> >
> > Here are my considerations:
> >
> > 1.
> > About the Operations interfaces, how can they be empty?
> > Should not they provide at least a `run` or `execute` method (similar to
> > the command pattern)?
> > In this way, their implementation can wrap all the implementations
> details
> > of particular schedulers, and the scheduler can simply execute the
> command.
> > In general, I think a simple sequence diagram showcasing the interaction
> > between the interfaces would be awesome to better understand the concept.
> >
> > 2.
> > What about the RefreshHandler, I cannot find a definition of its
> interface
> > here.
> > Is it out of scope for this FLIP?
> >
> > 3.
> > For the SqlGatewayService arguments:
> >
> > boolean isPeriodic,
> > @Nullable String scheduleTime,
> > @Nullable String scheduleTimeFormat,
> >
> > If it is periodic, where is the period?
> > For the scheduleTime and format, why not simply pass an instance of
> > LocalDateTime or similar? The gateway should not have the responsibility
> to
> > parse the time.
> >
> > 4.
> > For the REST API:
> > wouldn't it be better (more REST) to move the `mt_identifier` to the URL?
> > E.g.: v3/materialized_tables//refresh
> >
> > Thank you!
> > On Apr 22, 2024 at 08:42 +0200, Ron Liu , wrote:
> > > Hi, Dev
> > >
> > > I would like to start a discussion about FLIP-448: Introduce Pluggable
> > > Workflow Scheduler Interface for Materialized Table.
> > >
> > > In FLIP-435[1], we proposed Materialized Table, which has two types of
> > data
> > > refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
> > > mode, the Materialized Table relies on a workflow scheduler to perform
> > > periodic refresh operation to achieve the desired data freshness.
> > >
> > > There are numerous open-source workflow schedulers available, with
> > popular
> > > ones including Airflow and DolphinScheduler. To enable Materialized
> Table
> > > to work with different workflow schedulers, we propose a pluggable
> > workflow
> > > scheduler interface for Materialized Table in this FLIP.
> > >
> > > For more details, see FLIP-448 [2]. Looking forward to your feedback.
> > >
> > > [1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > >
> > > Best,
> > > Ron
> >
>


[DISCUSS] FLIP-448: Introduce Pluggable Workflow Scheduler Interface for Materialized Table

2024-04-22 Thread Ron Liu
Hi, Dev

I would like to start a discussion about FLIP-448: Introduce Pluggable
Workflow Scheduler Interface for Materialized Table.

In FLIP-435[1], we proposed Materialized Table, which has two types of data
refresh modes: Full Refresh & Continuous Refresh Mode. In Full Refresh
mode, the Materialized Table relies on a workflow scheduler to perform
periodic refresh operation to achieve the desired data freshness.

There are numerous open-source workflow schedulers available, with popular
ones including Airflow and DolphinScheduler. To enable Materialized Table
to work with different workflow schedulers, we propose a pluggable workflow
scheduler interface for Materialized Table in this FLIP.

For more details, see FLIP-448 [2]. Looking forward to your feedback.

[1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table

Best,
Ron


[RESULT][VOTE] FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines

2024-04-22 Thread Ron Liu
Hi, Dev

I'm happy to announce that FLIP-435: Introduce a New Materialized Table for
Simplifying Data Pipelines[1] has been accepted with 13 approving votes (8
binding) [2]

- Ron Liu(binding)
- Feng Jin
- Rui Fan(binding)
- Yuepeng Pan
- Ahmed Hamdy
- Ferenc Csaky
- Lincoln Lee(binding)
- Leonard Xu(binding)
- Jark Wu(binding)
- Yun Tang(binding)
- Jinsong Li(binding)
- Zhongqiang Gong
- Martijn Visser(binding)

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
[2] https://lists.apache.org/thread/woj27nsmx5xd7p87ryfo8h6gx37n3wlx

Best,
Ron


Re: [DISCUSS] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-18 Thread Ron Liu
Hi, Xia

Thanks for updating, looks good to me.

Best,
Ron

Xia Sun  于2024年4月18日周四 19:11写道:

> Hi Ron,
> Yes, presenting it in a table might be more intuitive. I have already added
> the table in the "Public Interfaces | New Config Option" chapter of FLIP.
> PTAL~
>
> Ron Liu  于2024年4月18日周四 18:10写道:
>
> > Hi, Xia
> >
> > Thanks for your reply.
> >
> > > That means, in terms
> > of priority, `table.exec.hive.infer-source-parallelism` >
> > `table.exec.hive.infer-source-parallelism.mode`.
> >
> > I still have some confusion, if the
> > `table.exec.hive.infer-source-parallelism`
> > >`table.exec.hive.infer-source-parallelism.mode`, currently
> > `table.exec.hive.infer-source-parallelism` default value is true, that
> > means always static parallelism inference work? Or perhaps after this
> FLIP,
> > we changed the default behavior of
> > `table.exec.hive.infer-source-parallelism` to indicate dynamic
> parallelism
> > inference when enabled.
> > I think you should list the various behaviors of these two options that
> > coexist in FLIP by a table, only then users can know how the dynamic and
> > static parallelism inference work.
> >
> > Best,
> > Ron
> >
> > Xia Sun  于2024年4月18日周四 16:33写道:
> >
> > > Hi Ron and Lijie,
> > > Thanks for joining the discussion and sharing your suggestions.
> > >
> > > > the InferMode class should also be introduced in the Public
> Interfaces
> > > > section!
> > >
> > >
> > > Thanks for the reminder, I have now added the InferMode class to the
> > Public
> > > Interfaces section as well.
> > >
> > > > `table.exec.hive.infer-source-parallelism.max` is 1024, I checked
> > through
> > > > the code that the default value is 1000?
> > >
> > >
> > > I have checked and the default value of
> > > `table.exec.hive.infer-source-parallelism.max` is indeed 1000. This has
> > > been corrected in the FLIP.
> > >
> > > > how are`table.exec.hive.infer-source-parallelism` and
> > > > `table.exec.hive.infer-source-parallelism.mode` compatible?
> > >
> > >
> > > This is indeed a critical point. The current plan is to deprecate
> > > `table.exec.hive.infer-source-parallelism` but still utilize it as the
> > main
> > > switch for enabling automatic parallelism inference. That means, in
> terms
> > > of priority, `table.exec.hive.infer-source-parallelism` >
> > > `table.exec.hive.infer-source-parallelism.mode`. In future versions, if
> > > `table.exec.hive.infer-source-parallelism` is removed, this logic will
> > also
> > > need to be revised, leaving only
> > > `table.exec.hive.infer-source-parallelism.mode` as the basis for
> deciding
> > > whether to enable parallelism inference. I have also added this
> > description
> > > to the FLIP.
> > >
> > >
> > > > In FLIP-367 it is supported to be able to set the Source's
> parallelism
> > > > individually, if in the future HiveSource also supports this feature,
> > > > however, the default value of
> > > > `table.exec.hive.infer-source-parallelism.mode` is
> `InferMode.DYNAMIC`,
> > > at
> > > > this point will the parallelism be dynamically derived or will the
> > > manually
> > > > set parallelism take effect, and who has the higher priority?
> > >
> > >
> > > From my understanding, 'manually set parallelism' has the higher
> > priority,
> > > just like one of the preconditions for the effectiveness of dynamic
> > > parallelism inference in the AdaptiveBatchScheduler is that the
> vertex's
> > > parallelism isn't set. I believe whether it's static inference or
> dynamic
> > > inference, the manually set parallelism by the user should be
> respected.
> > >
> > > > The `InferMode.NONE` option.
> > >
> > > Currently, 'adding InferMode.NONE' seems to be the prevailing opinion.
> I
> > > will add InferMode.NONE as one of the Enum options in InferMode class.
> > >
> > > Best,
> > > Xia
> > >
> > > Lijie Wang  于2024年4月18日周四 13:50写道:
> > >
> > > > Thanks for driving the discussion.
> > > >
> > > > +1 for the proposal and +1 for the `InferMode.NONE` option.
> > > >
> > > > Best,
> > > > Lijie
> > > >
> > > > Ron liu  于2024年4月1

Re: [DISCUSS] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-18 Thread Ron Liu
Hi, Xia

Thanks for your reply.

> That means, in terms
of priority, `table.exec.hive.infer-source-parallelism` >
`table.exec.hive.infer-source-parallelism.mode`.

I still have some confusion, if the
`table.exec.hive.infer-source-parallelism`
>`table.exec.hive.infer-source-parallelism.mode`, currently
`table.exec.hive.infer-source-parallelism` default value is true, that
means always static parallelism inference work? Or perhaps after this FLIP,
we changed the default behavior of
`table.exec.hive.infer-source-parallelism` to indicate dynamic parallelism
inference when enabled.
I think you should list the various behaviors of these two options that
coexist in FLIP by a table, only then users can know how the dynamic and
static parallelism inference work.

Best,
Ron

Xia Sun  于2024年4月18日周四 16:33写道:

> Hi Ron and Lijie,
> Thanks for joining the discussion and sharing your suggestions.
>
> > the InferMode class should also be introduced in the Public Interfaces
> > section!
>
>
> Thanks for the reminder, I have now added the InferMode class to the Public
> Interfaces section as well.
>
> > `table.exec.hive.infer-source-parallelism.max` is 1024, I checked through
> > the code that the default value is 1000?
>
>
> I have checked and the default value of
> `table.exec.hive.infer-source-parallelism.max` is indeed 1000. This has
> been corrected in the FLIP.
>
> > how are`table.exec.hive.infer-source-parallelism` and
> > `table.exec.hive.infer-source-parallelism.mode` compatible?
>
>
> This is indeed a critical point. The current plan is to deprecate
> `table.exec.hive.infer-source-parallelism` but still utilize it as the main
> switch for enabling automatic parallelism inference. That means, in terms
> of priority, `table.exec.hive.infer-source-parallelism` >
> `table.exec.hive.infer-source-parallelism.mode`. In future versions, if
> `table.exec.hive.infer-source-parallelism` is removed, this logic will also
> need to be revised, leaving only
> `table.exec.hive.infer-source-parallelism.mode` as the basis for deciding
> whether to enable parallelism inference. I have also added this description
> to the FLIP.
>
>
> > In FLIP-367 it is supported to be able to set the Source's parallelism
> > individually, if in the future HiveSource also supports this feature,
> > however, the default value of
> > `table.exec.hive.infer-source-parallelism.mode` is `InferMode.DYNAMIC`,
> at
> > this point will the parallelism be dynamically derived or will the
> manually
> > set parallelism take effect, and who has the higher priority?
>
>
> From my understanding, 'manually set parallelism' has the higher priority,
> just like one of the preconditions for the effectiveness of dynamic
> parallelism inference in the AdaptiveBatchScheduler is that the vertex's
> parallelism isn't set. I believe whether it's static inference or dynamic
> inference, the manually set parallelism by the user should be respected.
>
> > The `InferMode.NONE` option.
>
> Currently, 'adding InferMode.NONE' seems to be the prevailing opinion. I
> will add InferMode.NONE as one of the Enum options in InferMode class.
>
> Best,
> Xia
>
> Lijie Wang  于2024年4月18日周四 13:50写道:
>
> > Thanks for driving the discussion.
> >
> > +1 for the proposal and +1 for the `InferMode.NONE` option.
> >
> > Best,
> > Lijie
> >
> > Ron liu  于2024年4月18日周四 11:36写道:
> >
> > > Hi, Xia
> > >
> > > Thanks for driving this FLIP.
> > >
> > > This proposal looks good to me overall. However, I have the following
> > minor
> > > questions:
> > >
> > > 1. FLIP introduced `table.exec.hive.infer-source-parallelism.mode` as a
> > new
> > > parameter, and the value is the enum class `InferMode`, I think the
> > > InferMode class should also be introduced in the Public Interfaces
> > section!
> > > 2. You mentioned in FLIP that the default value of
> > > `table.exec.hive.infer-source-parallelism.max` is 1024, I checked
> through
> > > the code that the default value is 1000?
> > > 3. I also agree with Muhammet's idea that there is no need to introduce
> > the
> > > option `table.exec.hive.infer-source-parallelism.enabled`, and that
> > > expanding the InferMode values will fulfill the need. There is another
> > > issue to consider here though, how are
> > > `table.exec.hive.infer-source-parallelism` and
> > > `table.exec.hive.infer-source-parallelism.mode` compatible?
> > > 4. In FLIP-367 it is supported to be able to set the Source's
> parallelism
> > > individually, if in the future HiveSource also su

Re: [DISCUSS] FLIP-445: Support dynamic parallelism inference for HiveSource

2024-04-17 Thread Ron liu
Hi, Xia

Thanks for driving this FLIP.

This proposal looks good to me overall. However, I have the following minor
questions:

1. FLIP introduced `table.exec.hive.infer-source-parallelism.mode` as a new
parameter, and the value is the enum class `InferMode`, I think the
InferMode class should also be introduced in the Public Interfaces section!
2. You mentioned in FLIP that the default value of
`table.exec.hive.infer-source-parallelism.max` is 1024, I checked through
the code that the default value is 1000?
3. I also agree with Muhammet's idea that there is no need to introduce the
option `table.exec.hive.infer-source-parallelism.enabled`, and that
expanding the InferMode values will fulfill the need. There is another
issue to consider here though, how are
`table.exec.hive.infer-source-parallelism` and
`table.exec.hive.infer-source-parallelism.mode` compatible?
4. In FLIP-367 it is supported to be able to set the Source's parallelism
individually, if in the future HiveSource also supports this feature,
however, the default value of
`table.exec.hive.infer-source-parallelism.mode` is `InferMode. DYNAMIC`, at
this point will the parallelism be dynamically derived or will the manually
set parallelism take effect, and who has the higher priority?

Best,
Ron

Xia Sun  于2024年4月17日周三 12:08写道:

> Hi Jeyhun, Muhammet,
> Thanks for all the feedback!
>
> > Could you please mention the default values for the new configurations
> > (e.g., table.exec.hive.infer-source-parallelism.mode,
> > table.exec.hive.infer-source-parallelism.enabled,
> > etc) ?
>
>
> Thanks for your suggestion. I have supplemented the explanation regarding
> the default values.
>
> > Since we are introducing the mode as a configuration option,
> > could it make sense to have `InferMode.NONE` option also?
> > The `NONE` option would disable the inference.
>
>
> This is a good idea. Looking ahead, it could eliminate the need for
> introducing
> a new configuration option. I haven't identified any potential
> compatibility issues
> as yet. If there are no further ideas from others, I'll go ahead and update
> the FLIP to
> introducing InferMode.NONE.
>
> Best,
> Xia
>
> Muhammet Orazov  于2024年4月17日周三 10:31写道:
>
> > Hello Xia,
> >
> > Thanks for the FLIP!
> >
> > Since we are introducing the mode as a configuration option,
> > could it make sense to have `InferMode.NONE` option also?
> > The `NONE` option would disable the inference.
> >
> > This way we deprecate the `table.exec.hive.infer-source-parallelism`
> > and no additional `table.exec.hive.infer-source-parallelism.enabled`
> > option is required.
> >
> > What do you think?
> >
> > Best,
> > Muhammet
> >
> > On 2024-04-16 07:07, Xia Sun wrote:
> > > Hi everyone,
> > > I would like to start a discussion on FLIP-445: Support dynamic
> > > parallelism
> > > inference for HiveSource[1].
> > >
> > > FLIP-379[2] has introduced dynamic source parallelism inference for
> > > batch
> > > jobs, which can utilize runtime information to more accurately decide
> > > the
> > > source parallelism. As a follow-up task, we plan to implement the
> > > dynamic
> > > parallelism inference interface for HiveSource, and also switch the
> > > default
> > > static parallelism inference to dynamic parallelism inference.
> > >
> > > Looking forward to your feedback and suggestions, thanks.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-445%3A+Support+dynamic+parallelism+inference+for+HiveSource
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-379%3A+Dynamic+source+parallelism+inference+for+batch+jobs
> > >
> > > Best regards,
> > > Xia
> >
>


Re: [VOTE] FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines

2024-04-17 Thread Ron liu
+1(binding)

Best,
Ron

Ron liu  于2024年4月17日周三 14:27写道:

> Hi Dev,
>
> Thank you to everyone for the feedback on FLIP-435: Introduce a New
> Materialized Table for Simplifying Data Pipelines[1][2].
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours unless there is an objection or not enough votes.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
> [2] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
>
> Best,
> Ron
>


[VOTE] FLIP-435: Introduce a New Materialized Table for Simplifying Data Pipelines

2024-04-17 Thread Ron liu
Hi Dev,

Thank you to everyone for the feedback on FLIP-435: Introduce a New
Materialized Table for Simplifying Data Pipelines[1][2].

I'd like to start a vote for it. The vote will be open for at least 72
hours unless there is an objection or not enough votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
[2] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs

Best,
Ron


Re: [ANNOUNCE] New Apache Flink Committer - Zakelly Lan

2024-04-14 Thread Ron liu
Congratulations!

Best,
Ron

Yuan Mei  于2024年4月15日周一 10:51写道:

> Hi everyone,
>
> On behalf of the PMC, I'm happy to let you know that Zakelly Lan has become
> a new Flink Committer!
>
> Zakelly has been continuously contributing to the Flink project since 2020,
> with a focus area on Checkpointing, State as well as frocksdb (the default
> on-disk state db).
>
> He leads several FLIPs to improve checkpoints and state APIs, including
> File Merging for Checkpoints and configuration/API reorganizations. He is
> also one of the main contributors to the recent efforts of "disaggregated
> state management for Flink 2.0" and drives the entire discussion in the
> mailing thread, demonstrating outstanding technical depth and breadth of
> knowledge.
>
> Beyond his technical contributions, Zakelly is passionate about helping the
> community in numerous ways. He spent quite some time setting up the Flink
> Speed Center and rebuilding the benchmark pipeline after the original one
> was out of lease. He helps build frocksdb and tests for the upcoming
> frocksdb release (bump rocksdb from 6.20.3->8.10).
>
> Please join me in congratulating Zakelly for becoming an Apache Flink
> committer!
>
> Best,
> Yuan (on behalf of the Flink PMC)
>


Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-12 Thread Ron liu
FLIP-282 [1] has also introduced Update & Delete API for modifying table.

1.
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=235838061

Best,
Ron

Ron liu  于2024年4月12日周五 19:49写道:

> Hi, jgrier
>
> Thanks for your insightful input.
>
> First of all, very much agree with you that it is a right direction that
> we should strive towards making Flink SQL more user-friendly, including
> simplifying the job execution parameters, execution modes, data processing
> pipeline definitions and maintenance, and so on.
> The goal of this proposal is also to simplify the data processing pipeline
> by proposing a new Dynamic Table, by combining Dynamic Table + Continuous,
> so that users can focus more on the business itself. Our goal is also not
> to create new business scenarios, it's just that the current Table can't
> support this goal, so we need to propose a new type of Dynamic Table.
>
> In the traditional Hive warehouse and Lakhouse scenario, the common
> requirement from users begins with ingesting DB data such as MySQL and logs
> in real-time into the ODS layer of the data warehouse. Then, defining a
> series of ETL jobs to process and layer the raw data, with the general data
> flow being ODS -> DWD -> DWS -> ADS, ultimately serving different users.
>
> During the business process, the following scenarios may need to modify
> the data:
> 1. Creating partitioned tables and manually backfilling certain historical
> partitions to correct data, meaning overwriting partitions is necessary.
> 2. Deleting a set of rows for regulatory compliance, updating a set of
> rows for data correction, such as deleting sensitive user information in a
> GDPR scenario.
> 3. With changes in business requirements, adding some columns is necessary
> but without wanting to refreshing historical partition data, so the new
> columns would only apply to the latest partitions.
>
> In the SQL standard regarding the definition of View, there are the
> following restrictions:
> 1. Partitioned view is not supported.
> 2. Modification of the data generated by views is not supported.
> 3. Alteration of a View's schema, such as adding columns, is not
> supported.
>
> Please correct me if my understanding is wrong.
>
> Materialized view, representing the result of a select query and serving
> as an index optimization technique mainly for query rewriting and
> computation acceleration, so share the same the same limitation as View. If
> we use materialized view, it can't meet our needs directly, we have to
> extend its semantics, which is in conflict with the standard. If we use a
> table, we don't have these concerns. Also assuming we extend the
> materialized view semantics to allow for modification, this would result in
> its inability to support query rewriting.
>
> Our proposal is indeed similar to the ability of materialized view, but
> considering the following two factors: firstly, we should try to follow the
> standard as much as possible without conflicting with it, and secondly,
> materialized view does not directly satisfy the scenario of modifying data,
> so using Table would be more appropriate.
>
> Although materialized view is also one of the candidates, it is not a more
> suitable option.
>
>
> > I'm actually against all of the other proposed names so I rank them
> equally
> last.  I don't think we need yet another new concept for this.  I think
> that will just add to users' confusion and learning curve which is already
> substantial with Flink.  We need to make things easier rather than harder.
>
> Also, just to clarify, and sorry if my previous statement may not be that
> accurate, this is not a new concept, it is just a new type of table,
> similar to the capabilities of materialized view, but simplifies the data
> processing pipeline, which is also aligned with the long term vision of
> Flink SQL.
>
>
> Best,
> Ron
>
>
> Jamie Grier  于2024年4月11日周四 05:59写道:
>
>> Sorry for coming very late to this thread.  I have not contributed much to
>> Flink publicly for quite some time but I have been involved with Flink,
>> daily, for years now and I'm keenly interested in where we take Flink SQL
>> going forward.
>>
>> Thanks for the proposal!!  I think it's definitely a step in the right
>> direction and I'm thrilled this is happening.  The end state I have in
>> mind
>> is that we get rid of execution modes as something users have to think
>> about and instead make sure the SQL a user writes completely describes
>> their intent.  In the case of this proposal the intent a user has is that
>> the system continually maintains an object (whatever we decide to call it)
>> that is the result of their qu

Re: [ANNOUNCE] New Apache Flink PMC Member - Lincoln Lee

2024-04-12 Thread Ron liu
Congratulations, Lincoln!

Best,
Ron

Junrui Lee  于2024年4月12日周五 18:54写道:

> Congratulations, Lincoln!
>
> Best,
> Junrui
>
> Aleksandr Pilipenko  于2024年4月12日周五 18:29写道:
>
> > Congratulations, Lincoln!
> >
> > Best Regards
> > Aleksandr
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Jing Ge

2024-04-12 Thread Ron liu
Congratulations, Jing!

Best,
Ron

Junrui Lee  于2024年4月12日周五 18:54写道:

> Congratulations, Jing!
>
> Best,
> Junrui
>
> Aleksandr Pilipenko  于2024年4月12日周五 18:28写道:
>
> > Congratulations, Jing!
> >
> > Best Regards,
> > Aleksandr
> >
>


Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-12 Thread Ron liu
jie (Becket) Qin
> > > >
> > > >
> > > >
> > > > On Tue, Apr 9, 2024 at 5:46 AM Lincoln Lee 
> > > wrote:
> > > >
> > > > > Thanks Ron and Timo for your proposal!
> > > > >
> > > > > Here is my ranking:
> > > > >
> > > > > 1. Derived table -> extend the persistent semantics of derived
> table
> > in
> > > > SQL
> > > > >standard, with a strong association with query, and has industry
> > > > > precedents
> > > > >such as Google Looker.
> > > > >
> > > > > 2. Live Table ->  an alternative for 'dynamic table'
> > > > >
> > > > > 3. Materialized Table -> combination of the Materialized View and
> > > Table,
> > > > > but
> > > > > still a table which accept data changes
> > > > >
> > > > > 4. Materialized View -> need to extend understanding of the view to
> > > > accept
> > > > > data changes
> > > > >
> > > > > The reason for not adding 'Refresh Table' is I don't want to tell
> the
> > > > user
> > > > > to 'refresh a refresh table'.
> > > > >
> > > > >
> > > > > Best,
> > > > > Lincoln Lee
> > > > >
> > > > >
> > > > > Ron liu  于2024年4月9日周二 20:11写道:
> > > > >
> > > > > > Hi, Dev
> > > > > >
> > > > > > My rankings are:
> > > > > >
> > > > > > 1. Derived Table
> > > > > > 2. Materialized Table
> > > > > > 3. Live Table
> > > > > > 4. Materialized View
> > > > > >
> > > > > > Best,
> > > > > > Ron
> > > > > >
> > > > > >
> > > > > >
> > > > > > Ron liu  于2024年4月9日周二 20:07写道:
> > > > > >
> > > > > > > Hi, Dev
> > > > > > >
> > > > > > > After several rounds of discussion, there is currently no
> > consensus
> > > > on
> > > > > > the
> > > > > > > name of the new concept. Timo has proposed that we decide the
> > name
> > > > > > through
> > > > > > > a vote. This is a good solution when there is no clear
> > preference,
> > > so
> > > > > we
> > > > > > > will adopt this approach.
> > > > > > >
> > > > > > > Regarding the name of the new concept, there are currently five
> > > > > > candidates:
> > > > > > > 1. Derived Table -> taken by SQL standard
> > > > > > > 2. Materialized Table -> similar to SQL materialized view but a
> > > table
> > > > > > > 3. Live Table -> similar to dynamic tables
> > > > > > > 4. Refresh Table -> states what it does
> > > > > > > 5. Materialized View -> needs to extend the standard to support
> > > > > modifying
> > > > > > > data
> > > > > > >
> > > > > > > For the above five candidates, everyone can give your rankings
> > > based
> > > > on
> > > > > > > your preferences. You can choose up to five options or only
> > choose
> > > > some
> > > > > > of
> > > > > > > them.
> > > > > > > We will use a scoring rule, where the* first rank gets 5
> points,
> > > > second
> > > > > > > rank gets 4 points, third rank gets 3 points, fourth rank gets
> 2
> > > > > points,
> > > > > > > and fifth rank gets 1 point*.
> > > > > > > After the voting closes, I will score all the candidates based
> on
> > > > > > > everyone's votes, and the candidate with the highest score will
> > be
> > > > > chosen
> > > > > > > as the name for the new concept.
> > > > > > >
> > > > > > > The voting will last up to 72 hours and is expected to close
> this
> > > > > Friday.
> > > > > > > I look forward to everyone voting on the name in this thread.
> Of
> > > > > course,
&

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu
Hi, Dev

My rankings are:

1. Derived Table
2. Materialized Table
3. Live Table
4. Materialized View

Best,
Ron



Ron liu  于2024年4月9日周二 20:07写道:

> Hi, Dev
>
> After several rounds of discussion, there is currently no consensus on the
> name of the new concept. Timo has proposed that we decide the name through
> a vote. This is a good solution when there is no clear preference, so we
> will adopt this approach.
>
> Regarding the name of the new concept, there are currently five candidates:
> 1. Derived Table -> taken by SQL standard
> 2. Materialized Table -> similar to SQL materialized view but a table
> 3. Live Table -> similar to dynamic tables
> 4. Refresh Table -> states what it does
> 5. Materialized View -> needs to extend the standard to support modifying
> data
>
> For the above five candidates, everyone can give your rankings based on
> your preferences. You can choose up to five options or only choose some of
> them.
> We will use a scoring rule, where the* first rank gets 5 points, second
> rank gets 4 points, third rank gets 3 points, fourth rank gets 2 points,
> and fifth rank gets 1 point*.
> After the voting closes, I will score all the candidates based on
> everyone's votes, and the candidate with the highest score will be chosen
> as the name for the new concept.
>
> The voting will last up to 72 hours and is expected to close this Friday.
> I look forward to everyone voting on the name in this thread. Of course, we
> also welcome new input regarding the name.
>
> Best,
> Ron
>
> Ron liu  于2024年4月9日周二 19:49写道:
>
>> Hi, Dev
>>
>> Sorry for my previous statement was not quite accurate. We will hold a
>> vote for the name within this thread.
>>
>> Best,
>> Ron
>>
>>
>> Ron liu  于2024年4月9日周二 19:29写道:
>>
>>> Hi, Timo
>>>
>>> Thanks for your reply.
>>>
>>> I agree with you that sometimes naming is more difficult. When no one
>>> has a clear preference, voting on the name is a good solution, so I'll send
>>> a separate email for the vote, clarify the rules for the vote, then let
>>> everyone vote.
>>>
>>> One other point to confirm, in your ranking there is an option for
>>> Materialized View, does it stand for the UPDATING Materialized View that
>>> you mentioned earlier in the discussion? If using Materialized View I think
>>> it is needed to extend it.
>>>
>>> Best,
>>> Ron
>>>
>>> Timo Walther  于2024年4月9日周二 17:20写道:
>>>
>>>> Hi Ron,
>>>>
>>>> yes naming is hard. But it will have large impact on trainings,
>>>> presentations, and the mental model of users. Maybe the easiest is to
>>>> collect ranking by everyone with some short justification:
>>>>
>>>>
>>>> My ranking (from good to not so good):
>>>>
>>>> 1. Refresh Table -> states what it does
>>>> 2. Materialized Table -> similar to SQL materialized view but a table
>>>> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
>>>> tables?
>>>> 4. Materialized View -> a bit broader than standard but still very
>>>> similar
>>>> 5. Derived table -> taken by standard
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>>
>>>> On 07.04.24 11:34, Ron liu wrote:
>>>> > Hi, Dev
>>>> >
>>>> > This is a summary letter. After several rounds of discussion, there
>>>> is a
>>>> > strong consensus about the FLIP proposal and the issues it aims to
>>>> address.
>>>> > The current point of disagreement is the naming of the new concept. I
>>>> have
>>>> > summarized the candidates as follows:
>>>> >
>>>> > 1. Derived Table (Inspired by Google Lookers)
>>>> >  - Pros: Google Lookers has introduced this concept, which is
>>>> designed
>>>> > for building Looker's automated modeling, aligning with our purpose
>>>> for the
>>>> > stream-batch automatic pipeline.
>>>> >
>>>> >  - Cons: The SQL standard uses derived table term extensively,
>>>> vendors
>>>> > adopt this for simply referring to a table within a subclause.
>>>> >
>>>> > 2. Materialized Table: It means materialize the query result to table,
>>>> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
>>>&

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu
Hi, Dev

After several rounds of discussion, there is currently no consensus on the
name of the new concept. Timo has proposed that we decide the name through
a vote. This is a good solution when there is no clear preference, so we
will adopt this approach.

Regarding the name of the new concept, there are currently five candidates:
1. Derived Table -> taken by SQL standard
2. Materialized Table -> similar to SQL materialized view but a table
3. Live Table -> similar to dynamic tables
4. Refresh Table -> states what it does
5. Materialized View -> needs to extend the standard to support modifying
data

For the above five candidates, everyone can give your rankings based on
your preferences. You can choose up to five options or only choose some of
them.
We will use a scoring rule, where the* first rank gets 5 points, second
rank gets 4 points, third rank gets 3 points, fourth rank gets 2 points,
and fifth rank gets 1 point*.
After the voting closes, I will score all the candidates based on
everyone's votes, and the candidate with the highest score will be chosen
as the name for the new concept.

The voting will last up to 72 hours and is expected to close this Friday. I
look forward to everyone voting on the name in this thread. Of course, we
also welcome new input regarding the name.

Best,
Ron

Ron liu  于2024年4月9日周二 19:49写道:

> Hi, Dev
>
> Sorry for my previous statement was not quite accurate. We will hold a
> vote for the name within this thread.
>
> Best,
> Ron
>
>
> Ron liu  于2024年4月9日周二 19:29写道:
>
>> Hi, Timo
>>
>> Thanks for your reply.
>>
>> I agree with you that sometimes naming is more difficult. When no one has
>> a clear preference, voting on the name is a good solution, so I'll send a
>> separate email for the vote, clarify the rules for the vote, then let
>> everyone vote.
>>
>> One other point to confirm, in your ranking there is an option for
>> Materialized View, does it stand for the UPDATING Materialized View that
>> you mentioned earlier in the discussion? If using Materialized View I think
>> it is needed to extend it.
>>
>> Best,
>> Ron
>>
>> Timo Walther  于2024年4月9日周二 17:20写道:
>>
>>> Hi Ron,
>>>
>>> yes naming is hard. But it will have large impact on trainings,
>>> presentations, and the mental model of users. Maybe the easiest is to
>>> collect ranking by everyone with some short justification:
>>>
>>>
>>> My ranking (from good to not so good):
>>>
>>> 1. Refresh Table -> states what it does
>>> 2. Materialized Table -> similar to SQL materialized view but a table
>>> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
>>> tables?
>>> 4. Materialized View -> a bit broader than standard but still very
>>> similar
>>> 5. Derived table -> taken by standard
>>>
>>> Regards,
>>> Timo
>>>
>>>
>>>
>>> On 07.04.24 11:34, Ron liu wrote:
>>> > Hi, Dev
>>> >
>>> > This is a summary letter. After several rounds of discussion, there is
>>> a
>>> > strong consensus about the FLIP proposal and the issues it aims to
>>> address.
>>> > The current point of disagreement is the naming of the new concept. I
>>> have
>>> > summarized the candidates as follows:
>>> >
>>> > 1. Derived Table (Inspired by Google Lookers)
>>> >  - Pros: Google Lookers has introduced this concept, which is
>>> designed
>>> > for building Looker's automated modeling, aligning with our purpose
>>> for the
>>> > stream-batch automatic pipeline.
>>> >
>>> >  - Cons: The SQL standard uses derived table term extensively,
>>> vendors
>>> > adopt this for simply referring to a table within a subclause.
>>> >
>>> > 2. Materialized Table: It means materialize the query result to table,
>>> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
>>> > Dynamic Table's predecessor is also called Materialized Table.
>>> >
>>> > 3. Updating Table (From Timo)
>>> >
>>> > 4. Updating Materialized View (From Timo)
>>> >
>>> > 5. Refresh/Live Table (From Martijn)
>>> >
>>> > As Martijn said, naming is a headache, looking forward to more valuable
>>> > input from everyone.
>>> >
>>> > [1]
>>> >
>>> https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
>>> > [2]
>>&

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu
Hi, Dev

Sorry for my previous statement was not quite accurate. We will hold a vote
for the name within this thread.

Best,
Ron


Ron liu  于2024年4月9日周二 19:29写道:

> Hi, Timo
>
> Thanks for your reply.
>
> I agree with you that sometimes naming is more difficult. When no one has
> a clear preference, voting on the name is a good solution, so I'll send a
> separate email for the vote, clarify the rules for the vote, then let
> everyone vote.
>
> One other point to confirm, in your ranking there is an option for
> Materialized View, does it stand for the UPDATING Materialized View that
> you mentioned earlier in the discussion? If using Materialized View I think
> it is needed to extend it.
>
> Best,
> Ron
>
> Timo Walther  于2024年4月9日周二 17:20写道:
>
>> Hi Ron,
>>
>> yes naming is hard. But it will have large impact on trainings,
>> presentations, and the mental model of users. Maybe the easiest is to
>> collect ranking by everyone with some short justification:
>>
>>
>> My ranking (from good to not so good):
>>
>> 1. Refresh Table -> states what it does
>> 2. Materialized Table -> similar to SQL materialized view but a table
>> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
>> tables?
>> 4. Materialized View -> a bit broader than standard but still very similar
>> 5. Derived table -> taken by standard
>>
>> Regards,
>> Timo
>>
>>
>>
>> On 07.04.24 11:34, Ron liu wrote:
>> > Hi, Dev
>> >
>> > This is a summary letter. After several rounds of discussion, there is a
>> > strong consensus about the FLIP proposal and the issues it aims to
>> address.
>> > The current point of disagreement is the naming of the new concept. I
>> have
>> > summarized the candidates as follows:
>> >
>> > 1. Derived Table (Inspired by Google Lookers)
>> >  - Pros: Google Lookers has introduced this concept, which is
>> designed
>> > for building Looker's automated modeling, aligning with our purpose for
>> the
>> > stream-batch automatic pipeline.
>> >
>> >  - Cons: The SQL standard uses derived table term extensively,
>> vendors
>> > adopt this for simply referring to a table within a subclause.
>> >
>> > 2. Materialized Table: It means materialize the query result to table,
>> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
>> > Dynamic Table's predecessor is also called Materialized Table.
>> >
>> > 3. Updating Table (From Timo)
>> >
>> > 4. Updating Materialized View (From Timo)
>> >
>> > 5. Refresh/Live Table (From Martijn)
>> >
>> > As Martijn said, naming is a headache, looking forward to more valuable
>> > input from everyone.
>> >
>> > [1]
>> >
>> https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
>> > [2]
>> https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
>> > [3]
>> >
>> https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables
>> >
>> > Best,
>> > Ron
>> >
>> > Ron liu  于2024年4月7日周日 15:55写道:
>> >
>> >> Hi, Lorenzo
>> >>
>> >> Thank you for your insightful input.
>> >>
>> >>>>> I think the 2 above twisted the materialized view concept to more
>> than
>> >> just an optimization for accessing pre-computed aggregates/filters.
>> >> I think that concept (at least in my mind) is now adherent to the
>> >> semantics of the words themselves ("materialized" and "view") than on
>> its
>> >> implementations in DBMs, as just a view on raw data that, hopefully, is
>> >> constantly updated with fresh results.
>> >> That's why I understand Timo's et al. objections.
>> >>
>> >> Your understanding of Materialized Views is correct. However, in our
>> >> scenario, an important feature is the support for Update & Delete
>> >> operations, which the current Materialized Views cannot fulfill. As we
>> >> discussed with Timo before, if Materialized Views needs to support data
>> >> modifications, it would require an extension of new keywords, such as
>> >> CREATING xxx (UPDATING) MATERIALIZED VIEW.
>> >>
>> >>>>> Still, I don't understand why we need another type of special table.
>

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-09 Thread Ron liu
Hi, Timo

Thanks for your reply.

I agree with you that sometimes naming is more difficult. When no one has a
clear preference, voting on the name is a good solution, so I'll send a
separate email for the vote, clarify the rules for the vote, then let
everyone vote.

One other point to confirm, in your ranking there is an option for
Materialized View, does it stand for the UPDATING Materialized View that
you mentioned earlier in the discussion? If using Materialized View I think
it is needed to extend it.

Best,
Ron

Timo Walther  于2024年4月9日周二 17:20写道:

> Hi Ron,
>
> yes naming is hard. But it will have large impact on trainings,
> presentations, and the mental model of users. Maybe the easiest is to
> collect ranking by everyone with some short justification:
>
>
> My ranking (from good to not so good):
>
> 1. Refresh Table -> states what it does
> 2. Materialized Table -> similar to SQL materialized view but a table
> 3. Live Table -> nice buzzword, but maybe still too close to dynamic
> tables?
> 4. Materialized View -> a bit broader than standard but still very similar
> 5. Derived table -> taken by standard
>
> Regards,
> Timo
>
>
>
> On 07.04.24 11:34, Ron liu wrote:
> > Hi, Dev
> >
> > This is a summary letter. After several rounds of discussion, there is a
> > strong consensus about the FLIP proposal and the issues it aims to
> address.
> > The current point of disagreement is the naming of the new concept. I
> have
> > summarized the candidates as follows:
> >
> > 1. Derived Table (Inspired by Google Lookers)
> >  - Pros: Google Lookers has introduced this concept, which is
> designed
> > for building Looker's automated modeling, aligning with our purpose for
> the
> > stream-batch automatic pipeline.
> >
> >  - Cons: The SQL standard uses derived table term extensively,
> vendors
> > adopt this for simply referring to a table within a subclause.
> >
> > 2. Materialized Table: It means materialize the query result to table,
> > similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
> > Dynamic Table's predecessor is also called Materialized Table.
> >
> > 3. Updating Table (From Timo)
> >
> > 4. Updating Materialized View (From Timo)
> >
> > 5. Refresh/Live Table (From Martijn)
> >
> > As Martijn said, naming is a headache, looking forward to more valuable
> > input from everyone.
> >
> > [1]
> >
> https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
> > [2] https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
> > [3]
> >
> https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables
> >
> > Best,
> > Ron
> >
> > Ron liu  于2024年4月7日周日 15:55写道:
> >
> >> Hi, Lorenzo
> >>
> >> Thank you for your insightful input.
> >>
> >>>>> I think the 2 above twisted the materialized view concept to more
> than
> >> just an optimization for accessing pre-computed aggregates/filters.
> >> I think that concept (at least in my mind) is now adherent to the
> >> semantics of the words themselves ("materialized" and "view") than on
> its
> >> implementations in DBMs, as just a view on raw data that, hopefully, is
> >> constantly updated with fresh results.
> >> That's why I understand Timo's et al. objections.
> >>
> >> Your understanding of Materialized Views is correct. However, in our
> >> scenario, an important feature is the support for Update & Delete
> >> operations, which the current Materialized Views cannot fulfill. As we
> >> discussed with Timo before, if Materialized Views needs to support data
> >> modifications, it would require an extension of new keywords, such as
> >> CREATING xxx (UPDATING) MATERIALIZED VIEW.
> >>
> >>>>> Still, I don't understand why we need another type of special table.
> >> Could you dive deep into the reasons why not simply adding the FRESHNESS
> >> parameter to standard tables?
> >>
> >> Firstly, I need to emphasize that we cannot achieve the design goal of
> >> FLIP through the CREATE TABLE syntax combined with a FRESHNESS
> parameter.
> >> The proposal of this FLIP is to use Dynamic Table + Continuous Query,
> and
> >> combine it with FRESHNESS to realize a streaming-batch unification.
> >> However, CREATE TABLE is merely a metadata operation and cannot
> >> automatically start a backgr

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-07 Thread Ron liu
Hi, Dev

This is a summary letter. After several rounds of discussion, there is a
strong consensus about the FLIP proposal and the issues it aims to address.
The current point of disagreement is the naming of the new concept. I have
summarized the candidates as follows:

1. Derived Table (Inspired by Google Lookers)
- Pros: Google Lookers has introduced this concept, which is designed
for building Looker's automated modeling, aligning with our purpose for the
stream-batch automatic pipeline.

- Cons: The SQL standard uses derived table term extensively, vendors
adopt this for simply referring to a table within a subclause.

2. Materialized Table: It means materialize the query result to table,
similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
Dynamic Table's predecessor is also called Materialized Table.

3. Updating Table (From Timo)

4. Updating Materialized View (From Timo)

5. Refresh/Live Table (From Martijn)

As Martijn said, naming is a headache, looking forward to more valuable
input from everyone.

[1]
https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
[2] https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
[3]
https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables

Best,
Ron

Ron liu  于2024年4月7日周日 15:55写道:

> Hi, Lorenzo
>
> Thank you for your insightful input.
>
> >>> I think the 2 above twisted the materialized view concept to more than
> just an optimization for accessing pre-computed aggregates/filters.
> I think that concept (at least in my mind) is now adherent to the
> semantics of the words themselves ("materialized" and "view") than on its
> implementations in DBMs, as just a view on raw data that, hopefully, is
> constantly updated with fresh results.
> That's why I understand Timo's et al. objections.
>
> Your understanding of Materialized Views is correct. However, in our
> scenario, an important feature is the support for Update & Delete
> operations, which the current Materialized Views cannot fulfill. As we
> discussed with Timo before, if Materialized Views needs to support data
> modifications, it would require an extension of new keywords, such as
> CREATING xxx (UPDATING) MATERIALIZED VIEW.
>
> >>> Still, I don't understand why we need another type of special table.
> Could you dive deep into the reasons why not simply adding the FRESHNESS
> parameter to standard tables?
>
> Firstly, I need to emphasize that we cannot achieve the design goal of
> FLIP through the CREATE TABLE syntax combined with a FRESHNESS parameter.
> The proposal of this FLIP is to use Dynamic Table + Continuous Query, and
> combine it with FRESHNESS to realize a streaming-batch unification.
> However, CREATE TABLE is merely a metadata operation and cannot
> automatically start a background refresh job. To achieve the design goal of
> FLIP with standard tables, it would require extending the CTAS[1] syntax to
> introduce the FRESHNESS keyword. We considered this design initially, but
> it has following problems:
>
> 1. Distinguishing a table created through CTAS as a standard table or as a
> "special" standard table with an ongoing background refresh job using the
> FRESHNESS keyword is very obscure for users.
> 2. It intrudes on the semantics of the CTAS syntax. Currently, tables
> created using CTAS only add table metadata to the Catalog and do not record
> attributes such as query. There are also no ongoing background refresh
> jobs, and the data writing operation happens only once at table creation.
> 3. For the framework, when we perform a certain kind of Alter Table
> behavior for a table, for the table created by specifying FRESHNESS and did
> not specify the FRESHNESS created table behavior how to distinguish , which
> will also cause confusion.
>
> In terms of the design goal of combining Dynamic Table + Continuous Query,
> the FLIP proposal cannot be realized by only extending the current stardand
> tables, so a new kind of dynamic table needs to be introduced at the
> first-level concept.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement
>
> Best,
> Ron
>
>  于2024年4月3日周三 22:25写道:
>
>> Hello everybody!
>> Thanks for the FLIP as it looks amazing (and I think the prove is this
>> deep discussion it is provoking :))
>>
>> I have a couple of comments to add to this:
>>
>> Even though I get the reason why you rejected MATERIALIZED VIEW, I still
>> like it a lot, and I would like to provide pointers on how the materialized
>> view concept twisted in last years:
>>
>> • Materialize DB (https://materiali

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-07 Thread Ron liu
BLE [1] which relates with this
> > proposal.
> >
> > I do have concerns about using CREATE DYNAMIC TABLE, specifically about
> > confusing the users who are familiar with Snowflake's approach where you
> > can't change the content via DML statements, while that is something that
> > would work in this proposal. Naming is hard of course, but I would
> probably
> > prefer something like CREATE CONTINUOUS TABLE, CREATE REFRESH TABLE or
> > CREATE LIVE TABLE.
> >
> > Best regards,
> >
> > Martijn
> >
> > [1]
> >
> https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table.html
> >
> > On Wed, Apr 3, 2024 at 5:19 AM Ron liu  wrote:
> >
> > > Hi, dev
> > >
> > > After offline discussion with Becket Qin, Lincoln Lee and Jark Wu, we
> have
> > > improved some parts of the FLIP.
> > >
> > > 1. Add Full Refresh Mode section to clarify the semantics of full
> refresh
> > > mode.
> > > 2. Add Future Improvement section explaining why query statement does
> not
> > > support references to temporary view and possible solutions.
> > > 3. The Future Improvement section explains a possible future solution
> for
> > > dynamic table to support the modification of query statements to meet
> the
> > > common field-level schema evolution requirements of the lakehouse.
> > > 4. The Refresh section emphasizes that the Refresh command and the
> > > background refresh job can be executed in parallel, with no
> restrictions at
> > > the framework level.
> > > 5. Convert RefreshHandler into a plug-in interface to support various
> > > workflow schedulers.
> > >
> > > Best,
> > > Ron
> > >
> > > Ron liu  于2024年4月2日周二 10:28写道:
> > >
> > > > > Hi, Venkata krishnan
> > > > >
> > > > > Thank you for your involvement and suggestions, and hope that the
> design
> > > > > goals of this FLIP will be helpful to your business.
> > > > >
> > > > > > > > >>> 1. In the proposed FLIP, given the example for the
> dynamic table, do
> > > > > the
> > > > > data sources always come from a single lake storage such as Paimon
> or
> > > does
> > > > > the same proposal solve for 2 disparate storage systems like Kafka
> and
> > > > > Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
> > > > > Basically the lambda architecture that is mentioned in the FLIP as
> well.
> > > > > I'm wondering if it is possible to switch b/w sources based on the
> > > > > execution mode, for eg: if it is backfill operation, switch to a
> data
> > > lake
> > > > > storage system like Iceberg, otherwise an event streaming system
> like
> > > > > Kafka.
> > > > >
> > > > > Dynamic table is a design abstraction at the framework level and
> is not
> > > > > tied to the physical implementation of the connector. If a
> connector
> > > > > supports a combination of Kafka and lake storage, this works fine.
> > > > >
> > > > > > > > >>> 2. What happens in the context of a bootstrap (batch) +
> nearline
> > > update
> > > > > (streaming) case that are stateful applications? What I mean by
> that is,
> > > > > will the state from the batch application be transferred to the
> nearline
> > > > > application after the bootstrap execution is complete?
> > > > >
> > > > > I think this is another orthogonal thing, something that FLIP-327
> tries
> > > to
> > > > > address, not directly related to Dynamic Table.
> > > > >
> > > > > [1]
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data
> > > > >
> > > > > Best,
> > > > > Ron
> > > > >
> > > > > Venkatakrishnan Sowrirajan  于2024年3月30日周六
> 07:06写道:
> > > > >
> > > > > >> Ron and Lincoln,
> > > > > >>
> > > > > >> Great proposal and interesting discussion for adding support
> for dynamic
> > > > > >> tables within Flink.
> > > > > >>
> > > > > >> At LinkedIn, we a

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-02 Thread Ron liu
Hi, dev

After offline discussion with Becket Qin, Lincoln Lee and Jark Wu,  we have
improved some parts of the FLIP.

1. Add Full Refresh Mode section to clarify the semantics of full refresh
mode.
2. Add Future Improvement section explaining why query statement does not
support references to temporary view and possible solutions.
3. The Future Improvement section explains a possible future solution for
dynamic table to support the modification of query statements to meet the
common field-level schema evolution requirements of the lakehouse.
4. The Refresh section emphasizes that the Refresh command and the
background refresh job can be executed in parallel, with no restrictions at
the framework level.
5. Convert RefreshHandler into a plug-in interface to support various
workflow schedulers.

Best,
Ron

Ron liu  于2024年4月2日周二 10:28写道:

> Hi, Venkata krishnan
>
> Thank you for your involvement and suggestions, and hope that the design
> goals of this FLIP will be helpful to your business.
>
> >>> 1. In the proposed FLIP, given the example for the dynamic table, do
> the
> data sources always come from a single lake storage such as Paimon or does
> the same proposal solve for 2 disparate storage systems like Kafka and
> Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
> Basically the lambda architecture that is mentioned in the FLIP as well.
> I'm wondering if it is possible to switch b/w sources based on the
> execution mode, for eg: if it is backfill operation, switch to a data lake
> storage system like Iceberg, otherwise an event streaming system like
> Kafka.
>
> Dynamic table is a design abstraction at the framework level and is not
> tied to the physical implementation of the connector. If a connector
> supports a combination of Kafka and lake storage, this works fine.
>
> >>> 2. What happens in the context of a bootstrap (batch) + nearline update
> (streaming) case that are stateful applications? What I mean by that is,
> will the state from the batch application be transferred to the nearline
> application after the bootstrap execution is complete?
>
> I think this is another orthogonal thing, something that FLIP-327 tries to
> address, not directly related to Dynamic Table.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data
>
> Best,
> Ron
>
> Venkatakrishnan Sowrirajan  于2024年3月30日周六 07:06写道:
>
>> Ron and Lincoln,
>>
>> Great proposal and interesting discussion for adding support for dynamic
>> tables within Flink.
>>
>> At LinkedIn, we are also trying to solve compute/storage convergence for
>> similar problems discussed as part of this FLIP, specifically periodic
>> backfill, bootstrap + nearline update use cases using single
>> implementation
>> of business logic (single script).
>>
>> Few clarifying questions:
>>
>> 1. In the proposed FLIP, given the example for the dynamic table, do the
>> data sources always come from a single lake storage such as Paimon or does
>> the same proposal solve for 2 disparate storage systems like Kafka and
>> Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
>> Basically the lambda architecture that is mentioned in the FLIP as well.
>> I'm wondering if it is possible to switch b/w sources based on the
>> execution mode, for eg: if it is backfill operation, switch to a data lake
>> storage system like Iceberg, otherwise an event streaming system like
>> Kafka.
>> 2. What happens in the context of a bootstrap (batch) + nearline update
>> (streaming) case that are stateful applications? What I mean by that is,
>> will the state from the batch application be transferred to the nearline
>> application after the bootstrap execution is complete?
>>
>> Regards
>> Venkata krishnan
>>
>>
>> On Mon, Mar 25, 2024 at 8:03 PM Ron liu  wrote:
>>
>> > Hi, Timo
>> >
>> > Thanks for your quick response, and your suggestion.
>> >
>> > Yes, this discussion has turned into confirming whether it's a special
>> > table or a special MV.
>> >
>> > 1. The key problem with MVs is that they don't support modification, so
>> I
>> > prefer it to be a special table. Although the periodic refresh behavior
>> is
>> > more characteristic of an MV, since we are already a special table,
>> > supporting periodic refresh behavior is quite natural, similar to
>> Snowflake
>> > dynamic tables.
>> >
>> > 2. Regarding the keyword UPDATING, since the current Regular Table is a
>> > Dyn

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-04-01 Thread Ron liu
Hi, Venkata krishnan

Thank you for your involvement and suggestions, and hope that the design
goals of this FLIP will be helpful to your business.

>>> 1. In the proposed FLIP, given the example for the dynamic table, do the
data sources always come from a single lake storage such as Paimon or does
the same proposal solve for 2 disparate storage systems like Kafka and
Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
Basically the lambda architecture that is mentioned in the FLIP as well.
I'm wondering if it is possible to switch b/w sources based on the
execution mode, for eg: if it is backfill operation, switch to a data lake
storage system like Iceberg, otherwise an event streaming system like Kafka.

Dynamic table is a design abstraction at the framework level and is not
tied to the physical implementation of the connector. If a connector
supports a combination of Kafka and lake storage, this works fine.

>>> 2. What happens in the context of a bootstrap (batch) + nearline update
(streaming) case that are stateful applications? What I mean by that is,
will the state from the batch application be transferred to the nearline
application after the bootstrap execution is complete?

I think this is another orthogonal thing, something that FLIP-327 tries to
address, not directly related to Dynamic Table.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data

Best,
Ron

Venkatakrishnan Sowrirajan  于2024年3月30日周六 07:06写道:

> Ron and Lincoln,
>
> Great proposal and interesting discussion for adding support for dynamic
> tables within Flink.
>
> At LinkedIn, we are also trying to solve compute/storage convergence for
> similar problems discussed as part of this FLIP, specifically periodic
> backfill, bootstrap + nearline update use cases using single implementation
> of business logic (single script).
>
> Few clarifying questions:
>
> 1. In the proposed FLIP, given the example for the dynamic table, do the
> data sources always come from a single lake storage such as Paimon or does
> the same proposal solve for 2 disparate storage systems like Kafka and
> Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
> Basically the lambda architecture that is mentioned in the FLIP as well.
> I'm wondering if it is possible to switch b/w sources based on the
> execution mode, for eg: if it is backfill operation, switch to a data lake
> storage system like Iceberg, otherwise an event streaming system like
> Kafka.
> 2. What happens in the context of a bootstrap (batch) + nearline update
> (streaming) case that are stateful applications? What I mean by that is,
> will the state from the batch application be transferred to the nearline
> application after the bootstrap execution is complete?
>
> Regards
> Venkata krishnan
>
>
> On Mon, Mar 25, 2024 at 8:03 PM Ron liu  wrote:
>
> > Hi, Timo
> >
> > Thanks for your quick response, and your suggestion.
> >
> > Yes, this discussion has turned into confirming whether it's a special
> > table or a special MV.
> >
> > 1. The key problem with MVs is that they don't support modification, so I
> > prefer it to be a special table. Although the periodic refresh behavior
> is
> > more characteristic of an MV, since we are already a special table,
> > supporting periodic refresh behavior is quite natural, similar to
> Snowflake
> > dynamic tables.
> >
> > 2. Regarding the keyword UPDATING, since the current Regular Table is a
> > Dynamic Table, which implies support for updating through Continuous
> Query,
> > I think it is redundant to add the keyword UPDATING. In addition,
> UPDATING
> > can not reflect the Continuous Query part, can not express the purpose we
> > want to simplify the data pipeline through Dynamic Table + Continuous
> > Query.
> >
> > 3. From the perspective of the SQL standard definition, I can understand
> > your concerns about Derived Table, but is it possible to make a slight
> > adjustment to meet our needs? Additionally, as Lincoln mentioned, the
> > Google Looker platform has introduced Persistent Derived Table, and there
> > are precedents in the industry; could Derived Table be a candidate?
> >
> > Of course, look forward to your better suggestions.
> >
> > Best,
> > Ron
> >
> >
> >
> > Timo Walther  于2024年3月25日周一 18:49写道:
> >
> > > After thinking about this more, this discussion boils down to whether
> > > this is a special table or a special materialized view. In both cases,
> > > we would need to add a special keyword:
> > >
> > > Either
> >

Re: Re: [ANNOUNCE] Apache Paimon is graduated to Top Level Project

2024-04-01 Thread Ron liu
Congratulations!

Best,
Ron

Jeyhun Karimov  于2024年4月1日周一 18:12写道:

> Congratulations!
>
> Regards,
> Jeyhun
>
> On Mon, Apr 1, 2024 at 7:43 AM Guowei Ma  wrote:
>
> > Congratulations!
> > Best,
> > Guowei
> >
> >
> > On Mon, Apr 1, 2024 at 11:15 AM Feng Jin  wrote:
> >
> > > Congratulations!
> > >
> > > Best,
> > > Feng Jin
> > >
> > > On Mon, Apr 1, 2024 at 10:51 AM weijie guo 
> > > wrote:
> > >
> > >> Congratulations!
> > >>
> > >> Best regards,
> > >>
> > >> Weijie
> > >>
> > >>
> > >> Hang Ruan  于2024年4月1日周一 09:49写道:
> > >>
> > >> > Congratulations!
> > >> >
> > >> > Best,
> > >> > Hang
> > >> >
> > >> > Lincoln Lee  于2024年3月31日周日 00:10写道:
> > >> >
> > >> > > Congratulations!
> > >> > >
> > >> > > Best,
> > >> > > Lincoln Lee
> > >> > >
> > >> > >
> > >> > > Jark Wu  于2024年3月30日周六 22:13写道:
> > >> > >
> > >> > > > Congratulations!
> > >> > > >
> > >> > > > Best,
> > >> > > > Jark
> > >> > > >
> > >> > > > On Fri, 29 Mar 2024 at 12:08, Yun Tang 
> wrote:
> > >> > > >
> > >> > > > > Congratulations to all Paimon guys!
> > >> > > > >
> > >> > > > > Glad to see a Flink sub-project has been graduated to an
> Apache
> > >> > > top-level
> > >> > > > > project.
> > >> > > > >
> > >> > > > > Best
> > >> > > > > Yun Tang
> > >> > > > >
> > >> > > > > 
> > >> > > > > From: Hangxiang Yu 
> > >> > > > > Sent: Friday, March 29, 2024 10:32
> > >> > > > > To: dev@flink.apache.org 
> > >> > > > > Subject: Re: Re: [ANNOUNCE] Apache Paimon is graduated to Top
> > >> Level
> > >> > > > Project
> > >> > > > >
> > >> > > > > Congratulations!
> > >> > > > >
> > >> > > > > On Fri, Mar 29, 2024 at 10:27 AM Benchao Li <
> > libenc...@apache.org
> > >> >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > Congratulations!
> > >> > > > > >
> > >> > > > > > Zakelly Lan  于2024年3月29日周五 10:25写道:
> > >> > > > > > >
> > >> > > > > > > Congratulations!
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Zakelly
> > >> > > > > > >
> > >> > > > > > > On Thu, Mar 28, 2024 at 10:13 PM Jing Ge
> > >> > >  > >> > > > >
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Congrats!
> > >> > > > > > > >
> > >> > > > > > > > Best regards,
> > >> > > > > > > > Jing
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Mar 28, 2024 at 1:27 PM Feifan Wang <
> > >> > zoltar9...@163.com>
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Congratulations!——
> > >> > > > > > > > >
> > >> > > > > > > > > Best regards,
> > >> > > > > > > > >
> > >> > > > > > > > > Feifan Wang
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > At 2024-03-28 20:02:43, "Yanfei Lei" <
> > fredia...@gmail.com
> > >> >
> > >> > > > wrote:
> > >> > > > > > > > > >Congratulations!
> > >> > > > > > > > > >
> > >> > > > > > > > > >Best,
> > >> > > > > > > > > >Yanfei
> > >> > > > > > > > > >
> > >> > > > > > > > > >Zhanghao Chen 
> > 于2024年3月28日周四
> > >> > > > 19:59写道:
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Congratulations!
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Best,
> > >> > > > > > > > > >> Zhanghao Chen
> > >> > > > > > > > > >> 
> > >> > > > > > > > > >> From: Yu Li 
> > >> > > > > > > > > >> Sent: Thursday, March 28, 2024 15:55
> > >> > > > > > > > > >> To: d...@paimon.apache.org 
> > >> > > > > > > > > >> Cc: dev ; user <
> > >> > u...@flink.apache.org
> > >> > > >
> > >> > > > > > > > > >> Subject: Re: [ANNOUNCE] Apache Paimon is graduated
> to
> > >> Top
> > >> > > > Level
> > >> > > > > > > > Project
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> CC the Flink user and dev mailing list.
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Paimon originated within the Flink community,
> > initially
> > >> > > known
> > >> > > > as
> > >> > > > > > Flink
> > >> > > > > > > > > >> Table Store, and all our incubating mentors are
> > >> members of
> > >> > > the
> > >> > > > > > Flink
> > >> > > > > > > > > >> Project Management Committee. I am confident that
> the
> > >> > bonds
> > >> > > of
> > >> > > > > > > > > >> enduring friendship and close collaboration will
> > >> continue
> > >> > to
> > >> > > > > > unite the
> > >> > > > > > > > > >> two communities.
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> And congratulations all!
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> Best Regards,
> > >> > > > > > > > > >> Yu
> > >> > > > > > > > > >>
> > >> > > > > > > > > >> On Wed, 27 Mar 2024 at 20:35, Guojun Li <
> > >> > > > > gjli.schna...@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > >> >
> > >> > > > > > > > > >> > Congratulations!
> > >> > > > > > > > > >> >
> > >> > > > > > > > > >> > Best,
> > >> > > > > > > > > >> > Guojun
> > >> > > > > > > > > >> >
> > >> > > > > > > > > >> > On Wed, Mar 27, 2024 at 5:24 PM wulin <
> > >> > > ouyangwu...@163.com>
> > >> > > > > > wrote:
> > >> > > > > > > > > >> >
> > >> > > > > > > 

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-25 Thread Ron liu
Hi, Timo

Thanks for your quick response, and your suggestion.

Yes, this discussion has turned into confirming whether it's a special
table or a special MV.

1. The key problem with MVs is that they don't support modification, so I
prefer it to be a special table. Although the periodic refresh behavior is
more characteristic of an MV, since we are already a special table,
supporting periodic refresh behavior is quite natural, similar to Snowflake
dynamic tables.

2. Regarding the keyword UPDATING, since the current Regular Table is a
Dynamic Table, which implies support for updating through Continuous Query,
I think it is redundant to add the keyword UPDATING. In addition, UPDATING
can not reflect the Continuous Query part, can not express the purpose we
want to simplify the data pipeline through Dynamic Table + Continuous Query.

3. From the perspective of the SQL standard definition, I can understand
your concerns about Derived Table, but is it possible to make a slight
adjustment to meet our needs? Additionally, as Lincoln mentioned, the
Google Looker platform has introduced Persistent Derived Table, and there
are precedents in the industry; could Derived Table be a candidate?

Of course, look forward to your better suggestions.

Best,
Ron



Timo Walther  于2024年3月25日周一 18:49写道:

> After thinking about this more, this discussion boils down to whether
> this is a special table or a special materialized view. In both cases,
> we would need to add a special keyword:
>
> Either
>
> CREATE UPDATING TABLE
>
> or
>
> CREATE UPDATING MATERIALIZED VIEW
>
> I still feel that the periodic refreshing behavior is closer to a MV. If
> we add a special keyword to MV, the optimizer would know that the data
> cannot be used for query optimizations.
>
> I will ask more people for their opinion.
>
> Regards,
> Timo
>
>
> On 25.03.24 10:45, Timo Walther wrote:
> > Hi Ron and Lincoln,
> >
> > thanks for the quick response and the very insightful discussion.
> >
> >  > we might limit future opportunities to optimize queries
> >  > through automatic materialization rewriting by allowing data
> >  > modifications, thus losing the potential for such optimizations.
> >
> > This argument makes a lot of sense to me. Due to the updates, the system
> > is not in full control of the persisted data. However, the system is
> > still in full control of the job that powers the refresh. So if the
> > system manages all updating pipelines, it could still leverage automatic
> > materialization rewriting but without leveraging the data at rest (only
> > the data in flight).
> >
> >  > we are considering another candidate, Derived Table, the term 'derive'
> >  > suggests a query, and 'table' retains modifiability. This approach
> >  > would not disrupt our current concept of a dynamic table
> >
> > I did some research on this term. The SQL standard uses the term
> > "derived table" extensively (defined in section 4.17.3). Thus, a lot of
> > vendors adopt this for simply referring to a table within a subclause:
> >
> > https://dev.mysql.com/doc/refman/8.0/en/derived-tables.html
> >
> >
> https://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc32300.1600/doc/html/san1390612291252.html
> >
> >
> https://www.c-sharpcorner.com/article/derived-tables-vs-common-table-expressions/
> >
> >
> https://stackoverflow.com/questions/26529804/what-are-the-derived-tables-in-my-explain-statement
> >
> > https://www.sqlservercentral.com/articles/sql-derived-tables
> >
> > Esp. the latter example is interesting, SQL Server allows things like
> > this on derived tables:
> >
> > UPDATE T SET Name='Timo' FROM (SELECT * FROM Product) AS T
> >
> > SELECT * FROM Product;
> >
> > Btw also Snowflake's dynamic table state:
> >
> >  > Because the content of a dynamic table is fully determined
> >  > by the given query, the content cannot be changed by using DML.
> >  > You don’t insert, update, or delete the rows in a dynamic table.
> >
> > So a new term makes a lot of sense.
> >
> > How about using `UPDATING`?
> >
> > CREATE UPDATING TABLE
> >
> > This reflects that modifications can be made and from an
> > English-language perspective you can PAUSE or RESUME the UPDATING.
> > Thus, a user can define UPDATING interval and mode?
> >
> > Looking forward to your thoughts.
> >
> > Regards,
> > Timo
> >
> >
> > On 25.03.24 07:09, Ron liu wrote:
> >> Hi, Ahmed
> >>
> >> Thanks for your feedback.
> >>
> >> Regarding your question:
&g

Re: Support minibatch for TopNFunction

2024-03-25 Thread Ron liu
Hi, Roman

Thanks for your proposal, I intuitively feel that this optimization would
be very useful to reduce the amount of message amplification for TopN
operators. After briefly looking at your google docs, I have the following
questions:

1. Whether you can describe in detail the principle of solving the TopN
operator record amplification, similar to Minibatch Join[1], through the
figure of current Motivation part, I can not understand how you did it
2. TopN has currently multiple implementation functions, including
AppendOnlyFirstNFunction, AppendOnlyTopNFunction, FastTop1Function,
RetractableTopNFunction, UpdatableTopNFunction. Is it possible to elaborate
on which patterns the Minibatch optimization applies to?
3. Is it possible to provide the PoC code?
4. finally, we need a formal FLIP document on the wiki[2].

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/#minibatch-regular-joins
[2]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

Best,
Ron

Roman Boyko  于2024年3月24日周日 01:14写道:

> Hi Flink Community,
>
> I tried to describe my idea about minibatch for TopNFunction in this doc -
>
> https://docs.google.com/document/d/1YPHwxKfiGSUOUOa6bc68fIJHO_UojTwZEC29VVEa-Uk/edit?usp=sharing
>
> Looking forward to your feedback, thank you
>
> On Tue, 19 Mar 2024 at 12:24, Roman Boyko  wrote:
>
> > Hello Flink Community,
> >
> > The same problem with record amplification as described in FLIP-415:
> Introduce
> > a new join operator to support minibatch[1] exists for most of
> > implementations of AbstractTopNFunction. Especially when the rank is
> > provided to output. For example, when calculating Top100 with rank
> output,
> > every input record might produce 100 -U records and 100 +U records.
> >
> > According to my POC (which is similar to FLIP-415) the record
> > amplification could be significantly reduced by using input or output
> > buffer.
> >
> > What do you think if we implement such optimization for TopNFunctions?
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-415
> > %3A+Introduce+a+new+join+operator+to+support+minibatch
> >
> > --
> > Best regards,
> > Roman Boyko
> > e.: ro.v.bo...@gmail.com
> > m.: +79059592443
> > telegram: @rboyko
> >
>
>
> --
> Best regards,
> Roman Boyko
> e.: ro.v.bo...@gmail.com
> m.: +79059592443
> telegram: @rboyko
>


Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-25 Thread Ron liu
Hi, Ahmed

Thanks for your feedback.

Regarding your question:

> I want to iterate on Timo's comments regarding the confusion between
"Dynamic Table" and current Flink "Table". Should the refactoring of the
system happen in 2.0, should we rename it in this Flip ( as the suggestions
in the thread ) and address the holistic changes in a separate Flip for 2.0?

Lincoln proposed a new concept in reply to Timo: Derived Table, which is a
combination of Dynamic Table + Continuous Query, and the use of Derived
Table will not conflict with existing concepts, what do you think?

> I feel confused with how it is further with other components, the
examples provided feel like a standalone ETL job, could you provide in the
FLIP an example where the table is further used in subsequent queries
(specially in batch mode).

Thanks for your suggestion, I added how to use Dynamic Table in FLIP user
story section, Dynamic Table can be referenced by downstream Dynamic Table
and can also support OLAP queries.

Best,
Ron

Ron liu  于2024年3月23日周六 10:35写道:

> Hi, Feng
>
> Thanks for your feedback.
>
> > Although currently we restrict users from modifying the query, I wonder
> if
> we can provide a better way to help users rebuild it without affecting
> downstream OLAP queries.
>
> Considering the problem of data consistency, so in the first step we are
> strictly limited in semantics and do not support modify the query. This is
> really a good problem, one of my ideas is to introduce a syntax similar to
> SWAP [1], which supports exchanging two Dynamic Tables.
>
> > From the documentation, the definitions SQL and job information are
> stored in the Catalog. Does this mean that if a system needs to adapt to
> Dynamic Tables, it also needs to store Flink's job information in the
> corresponding system?
> For example, does MySQL's Catalog need to store flink job information as
> well?
>
> Yes, currently we need to rely on Catalog to store refresh job information.
>
> > Users still need to consider how much memory is being used, how large
> the concurrency is, which type of state backend is being used, and may need
> to set TTL expiration.
>
> Similar to the current practice, job parameters can be set via the Flink
> conf or SET commands
>
> > When we submit a refresh command, can we help users detect if there are
> any
> running jobs and automatically stop them before executing the refresh
> command? Then wait for it to complete before restarting the background
> streaming job?
>
> Purely from a technical implementation point of view, your proposal is
> doable, but it would be more costly. Also I think data consistency itself
> is the responsibility of the user, similar to how Regular Table is now also
> the responsibility of the user, so it's consistent with its behavior and no
> additional guarantees are made at the engine level.
>
> Best,
> Ron
>
>
> Ahmed Hamdy  于2024年3月22日周五 23:50写道:
>
>> Hi Ron,
>> Sorry for joining the discussion late, thanks for the effort.
>>
>> I think the base idea is great, however I have a couple of comments:
>> - I want to iterate on Timo's comments regarding the confusion between
>> "Dynamic Table" and current Flink "Table". Should the refactoring of the
>> system happen in 2.0, should we rename it in this Flip ( as the
>> suggestions
>> in the thread ) and address the holistic changes in a separate Flip for
>> 2.0?
>> - I feel confused with how it is further with other components, the
>> examples provided feel like a standalone ETL job, could you provide in the
>> FLIP an example where the table is further used in subsequent queries
>> (specially in batch mode).
>> - I really like the standard of keeping the unified batch and streaming
>> approach
>> Best Regards
>> Ahmed Hamdy
>>
>>
>> On Fri, 22 Mar 2024 at 12:07, Lincoln Lee  wrote:
>>
>> > Hi Timo,
>> >
>> > Thanks for your thoughtful inputs!
>> >
>> > Yes, expanding the MATERIALIZED VIEW(MV) could achieve the same
>> function,
>> > but our primary concern is that by using a view, we might limit future
>> > opportunities
>> > to optimize queries through automatic materialization rewriting [1],
>> > leveraging
>> > the support for MV by physical storage. This is because we would be
>> > breaking
>> > the intuitive semantics of a materialized view (a materialized view
>> > represents
>> > the result of a query) by allowing data modifications, thus losing the
>> > potential
>> > for such optimizations.
>> >
>> > With these considerations in mind, we were i

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-22 Thread Ron liu
gt;>> the catalog or somewhere else.
> > > >>>
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-134%3A+Batch+execution+for+the+DataStream+API
> > > >>> [2] https://docs.snowflake.com/en/user-guide/dynamic-tables-about
> > > >>> [3]
> https://docs.snowflake.com/en/user-guide/dynamic-tables-refresh
> > > >>>
> > > >>> Best
> > > >>> Yun Tang
> > > >>>
> > > >>>
> > > >>> 
> > > >>> From: Lincoln Lee 
> > > >>> Sent: Thursday, March 14, 2024 14:35
> > > >>> To: dev@flink.apache.org 
> > > >>> Subject: Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for
> > > >>> Simplifying Data Pipelines
> > > >>>
> > > >>> Hi Jing,
> > > >>>
> > > >>> Thanks for your attention to this flip! I'll try to answer the
> > > following
> > > >>> questions.
> > > >>>
> > > >>>> 1. How to define query of dynamic table?
> > > >>>> Use flink sql or introducing new syntax?
> > > >>>> If use flink sql, how to handle the difference in SQL between
> > > streaming
> > > >>> and
> > > >>>> batch processing?
> > > >>>> For example, a query including window aggregate based on
> processing
> > > >> time?
> > > >>>> or a query including global order by?
> > > >>>
> > > >>> Similar to `CREATE TABLE AS query`, here the `query` also uses
> Flink
> > > sql
> > > >>> and
> > > >>>
> > > >>> doesn't introduce a totally new syntax.
> > > >>> We will not change the status respect to
> > > >>>
> > > >>> the difference in functionality of flink sql itself on streaming
> and
> > > >>> batch, for example,
> > > >>>
> > > >>> the proctime window agg on streaming and global sort on batch that
> > you
> > > >>> mentioned,
> > > >>>
> > > >>> in fact, do not work properly in the
> > > >>> other mode, so when the user modifies the
> > > >>>
> > > >>> refresh mode of a dynamic table that is not supported, we will
> throw
> > an
> > > >>> exception.
> > > >>>
> > > >>>> 2. Whether modify the query of dynamic table is allowed?
> > > >>>> Or we could only refresh a dynamic table based on the initial
> query?
> > > >>>
> > > >>> Yes, in the current design, the query definition of the
> > > >>> dynamic table is not allowed
> > > >>>
> > > >>>   to be modified, and you can only refresh the data based on the
> > > >>> initial definition.
> > > >>>
> > > >>>> 3. How to use dynamic table?
> > > >>>> The dynamic table seems to be similar to the materialized view.
> > Will
> > > >> we
> > > >>> do
> > > >>>> something like materialized view rewriting during the
> optimization?
> > > >>>
> > > >>> It's true that dynamic table and materialized view
> > > >>> are similar in some ways, but as Ron
> > > >>>
> > > >>> explains
> > > >>> there are differences. In terms of optimization, automated
> > > >>> materialization discovery
> > > >>>
> > > >>> similar to that supported by calcite is also a potential
> possibility,
> > > >>> perhaps with the
> > > >>>
> > > >>> addition of automated rewriting in the future.
> > > >>>
> > > >>>
> > > >>>
> > > >>> Best,
> > > >>> Lincoln Lee
> > > >>>
> > > >>>
> > > >>> Ron liu  于2024年3月14日周四 14:01写道:
> > > >>>
> > > >>>> Hi, Timo
> > > >>>>
> > > >>>> Sorry for later response,  thanks for your feedback.
> > > >&

Re: [ANNOUNCE] Donation Flink CDC into Apache Flink has Completed

2024-03-20 Thread Ron liu
Congratulations!

Best,
Ron

Jark Wu  于2024年3月21日周四 10:46写道:

> Congratulations and welcome!
>
> Best,
> Jark
>
> On Thu, 21 Mar 2024 at 10:35, Rui Fan <1996fan...@gmail.com> wrote:
>
> > Congratulations!
> >
> > Best,
> > Rui
> >
> > On Thu, Mar 21, 2024 at 10:25 AM Hang Ruan 
> wrote:
> >
> > > Congrattulations!
> > >
> > > Best,
> > > Hang
> > >
> > > Lincoln Lee  于2024年3月21日周四 09:54写道:
> > >
> > >>
> > >> Congrats, thanks for the great work!
> > >>
> > >>
> > >> Best,
> > >> Lincoln Lee
> > >>
> > >>
> > >> Peter Huang  于2024年3月20日周三 22:48写道:
> > >>
> > >>> Congratulations
> > >>>
> > >>>
> > >>> Best Regards
> > >>> Peter Huang
> > >>>
> > >>> On Wed, Mar 20, 2024 at 6:56 AM Huajie Wang 
> > wrote:
> > >>>
> > 
> >  Congratulations
> > 
> > 
> > 
> >  Best,
> >  Huajie Wang
> > 
> > 
> > 
> >  Leonard Xu  于2024年3月20日周三 21:36写道:
> > 
> > > Hi devs and users,
> > >
> > > We are thrilled to announce that the donation of Flink CDC as a
> > > sub-project of Apache Flink has completed. We invite you to explore
> > the new
> > > resources available:
> > >
> > > - GitHub Repository: https://github.com/apache/flink-cdc
> > > - Flink CDC Documentation:
> > > https://nightlies.apache.org/flink/flink-cdc-docs-stable
> > >
> > > After Flink community accepted this donation[1], we have completed
> > > software copyright signing, code repo migration, code cleanup,
> > website
> > > migration, CI migration and github issues migration etc.
> > > Here I am particularly grateful to Hang Ruan, Zhongqaing Gong,
> > > Qingsheng Ren, Jiabao Sun, LvYanquan, loserwang1024 and other
> > contributors
> > > for their contributions and help during this process!
> > >
> > >
> > > For all previous contributors: The contribution process has
> slightly
> > > changed to align with the main Flink project. To report bugs or
> > suggest new
> > > features, please open tickets
> > > Apache Jira (https://issues.apache.org/jira).  Note that we will
> no
> > > longer accept GitHub issues for these purposes.
> > >
> > >
> > > Welcome to explore the new repository and documentation. Your
> > feedback
> > > and contributions are invaluable as we continue to improve Flink
> CDC.
> > >
> > > Thanks everyone for your support and happy exploring Flink CDC!
> > >
> > > Best,
> > > Leonard
> > > [1]
> https://lists.apache.org/thread/cw29fhsp99243yfo95xrkw82s5s418ob
> > >
> > >
> >
>


Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Ron liu
Congratulations

Best,
Ron

Yanfei Lei  于2024年3月18日周一 20:01写道:

> Congrats, thanks for the great work!
>
> Sergey Nuyanzin  于2024年3月18日周一 19:30写道:
> >
> > Congratulations, thanks release managers and everyone involved for the
> great work!
> >
> > On Mon, Mar 18, 2024 at 12:15 PM Benchao Li 
> wrote:
> >>
> >> Congratulations! And thanks to all release managers and everyone
> >> involved in this release!
> >>
> >> Yubin Li  于2024年3月18日周一 18:11写道:
> >> >
> >> > Congratulations!
> >> >
> >> > Thanks to release managers and everyone involved.
> >> >
> >> > On Mon, Mar 18, 2024 at 5:55 PM Hangxiang Yu 
> wrote:
> >> > >
> >> > > Congratulations!
> >> > > Thanks release managers and all involved!
> >> > >
> >> > > On Mon, Mar 18, 2024 at 5:23 PM Hang Ruan 
> wrote:
> >> > >
> >> > > > Congratulations!
> >> > > >
> >> > > > Best,
> >> > > > Hang
> >> > > >
> >> > > > Paul Lam  于2024年3月18日周一 17:18写道:
> >> > > >
> >> > > > > Congrats! Thanks to everyone involved!
> >> > > > >
> >> > > > > Best,
> >> > > > > Paul Lam
> >> > > > >
> >> > > > > > 2024年3月18日 16:37,Samrat Deb  写道:
> >> > > > > >
> >> > > > > > Congratulations !
> >> > > > > >
> >> > > > > > On Mon, 18 Mar 2024 at 2:07 PM, Jingsong Li <
> jingsongl...@gmail.com>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > >> Congratulations!
> >> > > > > >>
> >> > > > > >> On Mon, Mar 18, 2024 at 4:30 PM Rui Fan <
> 1996fan...@gmail.com> wrote:
> >> > > > > >>>
> >> > > > > >>> Congratulations, thanks for the great work!
> >> > > > > >>>
> >> > > > > >>> Best,
> >> > > > > >>> Rui
> >> > > > > >>>
> >> > > > > >>> On Mon, Mar 18, 2024 at 4:26 PM Lincoln Lee <
> lincoln.8...@gmail.com>
> >> > > > > >> wrote:
> >> > > > > 
> >> > > > >  The Apache Flink community is very happy to announce the
> release of
> >> > > > > >> Apache Flink 1.19.0, which is the fisrt release for the
> Apache Flink
> >> > > > > 1.19
> >> > > > > >> series.
> >> > > > > 
> >> > > > >  Apache Flink® is an open-source stream processing
> framework for
> >> > > > > >> distributed, high-performing, always-available, and accurate
> data
> >> > > > > streaming
> >> > > > > >> applications.
> >> > > > > 
> >> > > > >  The release is available for download at:
> >> > > > >  https://flink.apache.org/downloads.html
> >> > > > > 
> >> > > > >  Please check out the release blog post for an overview of
> the
> >> > > > > >> improvements for this bugfix release:
> >> > > > > 
> >> > > > > >>
> >> > > > >
> >> > > >
> https://flink.apache.org/2024/03/18/announcing-the-release-of-apache-flink-1.19/
> >> > > > > 
> >> > > > >  The full release notes are available in Jira:
> >> > > > > 
> >> > > > > >>
> >> > > > >
> >> > > >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353282
> >> > > > > 
> >> > > > >  We would like to thank all contributors of the Apache Flink
> >> > > > community
> >> > > > > >> who made this release possible!
> >> > > > > 
> >> > > > > 
> >> > > > >  Best,
> >> > > > >  Yun, Jing, Martijn and Lincoln
> >> > > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Best,
> >> > > Hangxiang.
> >>
> >>
> >>
> >> --
> >>
> >> Best,
> >> Benchao Li
> >
> >
> >
> > --
> > Best regards,
> > Sergey
>
>
>
> --
> Best,
> Yanfei
>


Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-14 Thread Ron liu
some other questions.
> 1. How to define query of dynamic table?
> Use flink sql or introducing new syntax?
> If use flink sql, how to handle the difference in SQL between streaming and
> batch processing?
> For example, a query including window aggregate based on processing time?
> or a query including global order by?
>
> 2. Whether modify the query of dynamic table is allowed?
> Or we could only refresh a dynamic table based on initial query?
>
> 3. How to use dynamic table?
> The dynamic table seems to be similar with materialized view.  Will we do
> something like materialized view rewriting during the optimization?
>
> Best,
> Jing Zhang
>
>
> Timo Walther  于2024年3月13日周三 01:24写道:
>
> > Hi Lincoln & Ron,
> >
> > thanks for proposing this FLIP. I think a design similar to what you
> > propose has been in the heads of many people, however, I'm wondering how
> > this will fit into the bigger picture.
> >
> > I haven't deeply reviewed the FLIP yet, but would like to ask some
> > initial questions:
> >
> > Flink has introduced the concept of Dynamic Tables many years ago. How
> > does the term "Dynamic Table" fit into Flink's regular tables and also
> > how does it relate to Table API?
> >
> > I fear that adding the DYNAMIC TABLE keyword could cause confusion for
> > users, because a term for regular CREATE TABLE (that can be "kind of
> > dynamic" as well and is backed by a changelog) is then missing. Also
> > given that we call our connectors for those tables, DynamicTableSource
> > and DynamicTableSink.
> >
> > In general, I find it contradicting that a TABLE can be "paused" or
> > "resumed". From an English language perspective, this does sound
> > incorrect. In my opinion (without much research yet), a continuous
> > updating trigger should rather be modelled as a CREATE MATERIALIZED VIEW
> > (which users are familiar with?) or a new concept such as a CREATE TASK
> > (that can be paused and resumed?).
> >
> > How do you envision re-adding the functionality of a statement set, that
> > fans out to multiple tables? This is a very important use case for data
> > pipelines.
> >
> > Since the early days of Flink SQL, we were discussing `SELECT STREAM *
> > FROM T EMIT 5 MINUTES`. Your proposal seems to rephrase STREAM and EMIT,
> > into other keywords DYNAMIC TABLE and FRESHNESS. But the core
> > functionality is still there. I'm wondering if we should widen the scope
> > (maybe not part of this FLIP but a new FLIP) to follow the standard more
> > closely. Making `SELECT * FROM t` bounded by default and use new syntax
> > for the dynamic behavior. Flink 2.0 would be the perfect time for this,
> > however, it would require careful discussions. What do you think?
> >
> > Regards,
> > Timo
> >
> >
> > On 11.03.24 08:23, Ron liu wrote:
> > > Hi, Dev
> > >
> > >
> > > Lincoln Lee and I would like to start a discussion about FLIP-435:
> > > Introduce a  New Dynamic Table for Simplifying Data Pipelines.
> > >
> > >
> > > This FLIP is designed to simplify the development of data processing
> > > pipelines. With Dynamic Tables with uniform SQL statements and
> > > freshness, users can define batch and streaming transformations to
> > > data in the same way, accelerate ETL pipeline development, and manage
> > > task scheduling automatically.
> > >
> > >
> > > For more details, see FLIP-435 [1]. Looking forward to your feedback.
> > >
> > >
> > > [1]
> > >
> > >
> > > Best,
> > >
> > > Lincoln & Ron
> > >
> >
> >
>


Re: [VOTE] Release 1.19.0, release candidate #2

2024-03-11 Thread Ron liu
+1 (non binding)

quickly verified:
- verified that source distribution does not contain binaries
- verified checksums
- built source code successfully


Best,
Ron

Jeyhun Karimov  于2024年3月12日周二 01:00写道:

> +1 (non binding)
>
> - verified that source distribution does not contain binaries
> - verified signatures and checksums
> - built source code successfully
>
> Regards,
> Jeyhun
>
>
> On Mon, Mar 11, 2024 at 3:08 PM Samrat Deb  wrote:
>
> > +1 (non binding)
> >
> > - verified signatures and checksums
> > - ASF headers are present in all expected file
> > - No unexpected binaries files found in the source
> > - Build successful locally
> > - tested basic word count example
> >
> >
> >
> >
> > Bests,
> > Samrat
> >
> > On Mon, 11 Mar 2024 at 7:33 PM, Ahmed Hamdy 
> wrote:
> >
> > > Hi Lincoln
> > > +1 (non-binding) from me
> > >
> > > - Verified Checksums & Signatures
> > > - Verified Source dists don't contain binaries
> > > - Built source successfully
> > > - reviewed web PR
> > >
> > >
> > > Best Regards
> > > Ahmed Hamdy
> > >
> > >
> > > On Mon, 11 Mar 2024 at 15:18, Lincoln Lee 
> > wrote:
> > >
> > > > Hi Robin,
> > > >
> > > > Thanks for helping verifying the release note[1], FLINK-14879 should
> > not
> > > > have been included, after confirming this
> > > > I moved all unresolved non-blocker issues left over from 1.19.0 to
> > 1.20.0
> > > > and reconfigured the release note [1].
> > > >
> > > > Best,
> > > > Lincoln Lee
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353282
> > > >
> > > >
> > > > Robin Moffatt  于2024年3月11日周一 19:36写道:
> > > >
> > > > > Looking at the release notes [1] it lists `DESCRIBE DATABASE`
> > > > (FLINK-14879)
> > > > > and `DESCRIBE CATALOG` (FLINK-14690).
> > > > > When I try these in 1.19 RC2 the behaviour is as in 1.18.1, i.e. it
> > is
> > > > not
> > > > > supported:
> > > > >
> > > > > ```
> > > > > [INFO] Execute statement succeed.
> > > > >
> > > > > Flink SQL> show catalogs;
> > > > > +-+
> > > > > |catalog name |
> > > > > +-+
> > > > > |   c_new |
> > > > > | default_catalog |
> > > > > +-+
> > > > > 2 rows in set
> > > > >
> > > > > Flink SQL> DESCRIBE CATALOG c_new;
> > > > > [ERROR] Could not execute SQL statement. Reason:
> > > > > org.apache.calcite.sql.validate.SqlValidatorException: Column
> 'c_new'
> > > not
> > > > > found in any table
> > > > >
> > > > > Flink SQL> show databases;
> > > > > +--+
> > > > > |database name |
> > > > > +--+
> > > > > | default_database |
> > > > > +--+
> > > > > 1 row in set
> > > > >
> > > > > Flink SQL> DESCRIBE DATABASE default_database;
> > > > > [ERROR] Could not execute SQL statement. Reason:
> > > > > org.apache.calcite.sql.validate.SqlValidatorException: Column
> > > > > 'default_database' not found in
> > > > > any table
> > > > > ```
> > > > >
> > > > > Is this an error in the release notes, or my mistake in
> interpreting
> > > > them?
> > > > >
> > > > > thanks, Robin.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353282
> > > > >
> > > > > On Thu, 7 Mar 2024 at 10:01, Lincoln Lee 
> > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Please review and vote on the release candidate #2 for the
> version
> > > > > 1.19.0,
> > > > > > as follows:
> > > > > > [ ] +1, Approve the release
> > > > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > > >
> > > > > > The complete staging area is available for your review, which
> > > includes:
> > > > > >
> > > > > > * JIRA release notes [1], and the pull request adding release
> note
> > > for
> > > > > > users [2]
> > > > > > * the official Apache source release and binary convenience
> > releases
> > > to
> > > > > be
> > > > > > deployed to dist.apache.org [3], which are signed with the key
> > with
> > > > > > fingerprint E57D30ABEE75CA06  [4],
> > > > > > * all artifacts to be deployed to the Maven Central Repository
> [5],
> > > > > > * source code tag "release-1.19.0-rc2" [6],
> > > > > > * website pull request listing the new release and adding
> > > announcement
> > > > > blog
> > > > > > post [7].
> > > > > >
> > > > > > The vote will be open for at least 72 hours. It is adopted by
> > > majority
> > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12353282
> > > > > > [2] https://github.com/apache/flink/pull/24394
> > > > > > [3]
> https://dist.apache.org/repos/dist/dev/flink/flink-1.19.0-rc2/
> > > > > > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > > [5]
> > > > >
> > 

[DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-11 Thread Ron liu
Hi, Dev


Lincoln Lee and I would like to start a discussion about FLIP-435:
Introduce a  New Dynamic Table for Simplifying Data Pipelines.


This FLIP is designed to simplify the development of data processing
pipelines. With Dynamic Tables with uniform SQL statements and
freshness, users can define batch and streaming transformations to
data in the same way, accelerate ETL pipeline development, and manage
task scheduling automatically.


For more details, see FLIP-435 [1]. Looking forward to your feedback.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Dynamic+Table+for+Simplifying+Data+Pipelines

Best,

Lincoln & Ron


Re: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun

2024-02-25 Thread Ron liu
Congratulations, Jiabao!

Best,
Ron

Yun Tang  于2024年2月23日周五 19:59写道:

> Congratulations, Jiabao!
>
> Best
> Yun Tang
> 
> From: Weihua Hu 
> Sent: Thursday, February 22, 2024 17:29
> To: dev@flink.apache.org 
> Subject: Re: [ANNOUNCE] New Apache Flink Committer - Jiabao Sun
>
> Congratulations, Jiabao!
>
> Best,
> Weihua
>
>
> On Thu, Feb 22, 2024 at 10:34 AM Jingsong Li 
> wrote:
>
> > Congratulations! Well deserved!
> >
> > On Wed, Feb 21, 2024 at 4:36 PM Yuepeng Pan 
> wrote:
> > >
> > > Congratulations~ :)
> > >
> > > Best,
> > > Yuepeng Pan
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 在 2024-02-21 09:52:17,"Hongshun Wang"  写道:
> > > >Congratulations, Jiabao :)
> > > >Congratulations Jiabao!
> > > >
> > > >Best,
> > > >Hongshun
> > > >Best regards,
> > > >
> > > >Weijie
> > > >
> > > >On Tue, Feb 20, 2024 at 2:19 PM Runkang He  wrote:
> > > >
> > > >> Congratulations Jiabao!
> > > >>
> > > >> Best,
> > > >> Runkang He
> > > >>
> > > >> Jane Chan  于2024年2月20日周二 14:18写道:
> > > >>
> > > >> > Congrats, Jiabao!
> > > >> >
> > > >> > Best,
> > > >> > Jane
> > > >> >
> > > >> > On Tue, Feb 20, 2024 at 10:32 AM Paul Lam 
> > wrote:
> > > >> >
> > > >> > > Congrats, Jiabao!
> > > >> > >
> > > >> > > Best,
> > > >> > > Paul Lam
> > > >> > >
> > > >> > > > 2024年2月20日 10:29,Zakelly Lan  写道:
> > > >> > > >
> > > >> > > >> Congrats! Jiabao!
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> >
>


Re: Temporal join on rolling aggregate

2024-02-25 Thread Ron liu
+1,
But I think this should be a more general requirement, that is, support for
declaring watermarks in query, which can be declared for any type of
source, such as table, view. Similar to databricks provided [1], this needs
a FLIP.

[1]
https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-watermark.html

Best,
Ron


Re: Re: [VOTE] FLIP-389: Annotate SingleThreadFetcherManager as PublicEvolving

2024-01-19 Thread Ron liu
+1(binding)

Best,
Ron

Xuyang  于2024年1月19日周五 14:00写道:

> +1 (non-binding)
>
>
> --
>
> Best!
> Xuyang
>
>
>
>
>
> 在 2024-01-19 10:16:23,"Qingsheng Ren"  写道:
> >+1 (binding)
> >
> >Thanks for the work, Hongshun!
> >
> >Best,
> >Qingsheng
> >
> >On Tue, Jan 16, 2024 at 11:21 AM Leonard Xu  wrote:
> >
> >> Thanks Hongshun for driving this !
> >>
> >> +1(binding)
> >>
> >> Best,
> >> Leonard
> >>
> >> > 2024年1月3日 下午8:04,Hongshun Wang  写道:
> >> >
> >> > Dear Flink Developers,
> >> >
> >> > Thank you for providing feedback on FLIP-389: Annotate
> >> > SingleThreadFetcherManager as PublicEvolving[1] on the discussion
> >> > thread[2]. The goal of the FLIP is as follows:
> >> >
> >> >   - To expose the SplitFetcherManager / SingleThreadFetcheManager as
> >> >   Public, allowing connector developers to easily create their own
> >> threading
> >> >   models in the SourceReaderBase by implementing addSplits(),
> >> removeSplits(),
> >> >   maybeShutdownFinishedFetchers() and other functions.
> >> >   - To hide the element queue from the connector developers and
> simplify
> >> >   the SourceReaderBase to consist of only SplitFetcherManager and
> >> >   RecordEmitter as major components.
> >> >
> >> >
> >> > Any additional questions regarding this FLIP? Looking forward to
> hearing
> >> > from you.
> >> >
> >> >
> >> > Thanks,
> >> > Hongshun Wang
> >> >
> >> >
> >> > [1]
> >> >
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278465498
> >> >
> >> > [2] https://lists.apache.org/thread/b8f509878ptwl3kmmgg95tv8sb1j5987
> >>
> >>
>