Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

2024-03-13 Thread Jing Zhang
Hi, Lincoln & Ron,

Thanks for the proposal.

I agree with the question raised by Timo.

Besides, I have some other questions.
1. How to define query of dynamic table?
Use flink sql or introducing new syntax?
If use flink sql, how to handle the difference in SQL between streaming and
batch processing?
For example, a query including window aggregate based on processing time?
or a query including global order by?

2. Whether modify the query of dynamic table is allowed?
Or we could only refresh a dynamic table based on initial query?

3. How to use dynamic table?
The dynamic table seems to be similar with materialized view.  Will we do
something like materialized view rewriting during the optimization?

Best,
Jing Zhang


Timo Walther  于2024年3月13日周三 01:24写道:

> Hi Lincoln & Ron,
>
> thanks for proposing this FLIP. I think a design similar to what you
> propose has been in the heads of many people, however, I'm wondering how
> this will fit into the bigger picture.
>
> I haven't deeply reviewed the FLIP yet, but would like to ask some
> initial questions:
>
> Flink has introduced the concept of Dynamic Tables many years ago. How
> does the term "Dynamic Table" fit into Flink's regular tables and also
> how does it relate to Table API?
>
> I fear that adding the DYNAMIC TABLE keyword could cause confusion for
> users, because a term for regular CREATE TABLE (that can be "kind of
> dynamic" as well and is backed by a changelog) is then missing. Also
> given that we call our connectors for those tables, DynamicTableSource
> and DynamicTableSink.
>
> In general, I find it contradicting that a TABLE can be "paused" or
> "resumed". From an English language perspective, this does sound
> incorrect. In my opinion (without much research yet), a continuous
> updating trigger should rather be modelled as a CREATE MATERIALIZED VIEW
> (which users are familiar with?) or a new concept such as a CREATE TASK
> (that can be paused and resumed?).
>
> How do you envision re-adding the functionality of a statement set, that
> fans out to multiple tables? This is a very important use case for data
> pipelines.
>
> Since the early days of Flink SQL, we were discussing `SELECT STREAM *
> FROM T EMIT 5 MINUTES`. Your proposal seems to rephrase STREAM and EMIT,
> into other keywords DYNAMIC TABLE and FRESHNESS. But the core
> functionality is still there. I'm wondering if we should widen the scope
> (maybe not part of this FLIP but a new FLIP) to follow the standard more
> closely. Making `SELECT * FROM t` bounded by default and use new syntax
> for the dynamic behavior. Flink 2.0 would be the perfect time for this,
> however, it would require careful discussions. What do you think?
>
> Regards,
> Timo
>
>
> On 11.03.24 08:23, Ron liu wrote:
> > Hi, Dev
> >
> >
> > Lincoln Lee and I would like to start a discussion about FLIP-435:
> > Introduce a  New Dynamic Table for Simplifying Data Pipelines.
> >
> >
> > This FLIP is designed to simplify the development of data processing
> > pipelines. With Dynamic Tables with uniform SQL statements and
> > freshness, users can define batch and streaming transformations to
> > data in the same way, accelerate ETL pipeline development, and manage
> > task scheduling automatically.
> >
> >
> > For more details, see FLIP-435 [1]. Looking forward to your feedback.
> >
> >
> > [1]
> >
> >
> > Best,
> >
> > Lincoln & Ron
> >
>
>


Re: [ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-28 Thread Jing Zhang
Congratulations!

Best,
Jing Zhang

Rui Fan <1996fan...@gmail.com> 于2023年3月28日周二 22:25写道:

> Congratulations!
>
> Best,
> Rui Fan
>
> On Tue, Mar 28, 2023 at 15:37 Guowei Ma  wrote:
>
> > Congratulations!
> >
> > Best,
> > Guowei
> >
> >
> > On Tue, Mar 28, 2023 at 12:02 PM Yuxin Tan 
> wrote:
> >
> >> Congratulations!
> >>
> >> Best,
> >> Yuxin
> >>
> >>
> >> Guanghui Zhang  于2023年3月28日周二 11:06写道:
> >>
> >>> Congratulations!
> >>>
> >>> Best,
> >>> Zhang Guanghui
> >>>
> >>> Hang Ruan  于2023年3月28日周二 10:29写道:
> >>>
> >>> > Congratulations!
> >>> >
> >>> > Best,
> >>> > Hang
> >>> >
> >>> > yu zelin  于2023年3月28日周二 10:27写道:
> >>> >
> >>> >> Congratulations!
> >>> >>
> >>> >> Best,
> >>> >> Yu Zelin
> >>> >>
> >>> >> 2023年3月27日 17:23,Yu Li  写道:
> >>> >>
> >>> >> Dear Flinkers,
> >>> >>
> >>> >>
> >>> >>
> >>> >> As you may have noticed, we are pleased to announce that Flink Table
> >>> Store has joined the Apache Incubator as a separate project called
> Apache
> >>> Paimon(incubating) [1] [2] [3]. The new project still aims at building
> a
> >>> streaming data lake platform for high-speed data ingestion, change data
> >>> tracking and efficient real-time analytics, with the vision of
> supporting a
> >>> larger ecosystem and establishing a vibrant and neutral open source
> >>> community.
> >>> >>
> >>> >>
> >>> >>
> >>> >> We would like to thank everyone for their great support and efforts
> >>> for the Flink Table Store project, and warmly welcome everyone to join
> the
> >>> development and activities of the new project. Apache Flink will
> continue
> >>> to be one of the first-class citizens supported by Paimon, and we
> believe
> >>> that the Flink and Paimon communities will maintain close cooperation.
> >>> >>
> >>> >>
> >>> >> 亲爱的Flinkers,
> >>> >>
> >>> >>
> >>> >> 正如您可能已经注意到的,我们很高兴地宣布,Flink Table Store 已经正式加入 Apache
> >>> >> 孵化器独立孵化 [1] [2] [3]。新项目的名字是
> >>> >> Apache
> >>>
> Paimon(incubating),仍致力于打造一个支持高速数据摄入、流式数据订阅和高效实时分析的新一代流式湖仓平台。此外,新项目将支持更加丰富的生态,并建立一个充满活力和中立的开源社区。
> >>> >>
> >>> >>
> >>> >> 在这里我们要感谢大家对 Flink Table Store
> >>> >> 项目的大力支持和投入,并热烈欢迎大家加入新项目的开发和社区活动。Apache Flink 将继续作为 Paimon
> >>> 支持的主力计算引擎之一,我们也相信
> >>> >> Flink 和 Paimon 社区将继续保持密切合作。
> >>> >>
> >>> >>
> >>> >> Best Regards,
> >>> >> Yu (on behalf of the Apache Flink PMC and Apache Paimon PPMC)
> >>> >>
> >>> >> 致礼,
> >>> >> 李钰(谨代表 Apache Flink PMC 和 Apache Paimon PPMC)
> >>> >>
> >>> >> [1] https://paimon.apache.org/
> >>> >> [2] https://github.com/apache/incubator-paimon
> >>> >> [3]
> >>> https://cwiki.apache.org/confluence/display/INCUBATOR/PaimonProposal
> >>> >>
> >>> >>
> >>> >>
> >>>
> >>
>


Re: [VOTE] Release 1.16.1, release candidate #1

2023-01-29 Thread Jing Zhang
Thanks Martijn for driving this.
+1 (non-binding)

- verified signatures and checksums
- built from source code flink-1.16.1-src.tgz
- start a standalone Flink cluster, run the WordCount example, WebUI looks
good,  no suspicious output/log.
- start cluster and run some e2e sql queries using SQL Client, query result
is expected
- run some batch jobs, query results are almost as expected except the
issue FLINK-30567[1]. It’s not a blocker issue, so it would be OK to cherry
pick the bugfix to 1.16.2 version later as well.

[1] https://issues.apache.org/jira/browse/FLINK-30567

Best,
Jing Zhang


ConradJam  于2023年1月30日周一 10:33写道:

> +1 (non-binding)
>
>- built from sources sucessful
>- Run StateMachineExample on a local cluster and k8s cluster
>- check the pr
>
>
> Martijn Visser  于2023年1月20日周五 00:07写道:
>
> > Hi everyone,
> > Please review and vote on the release candidate #1 for the version
> 1.16.1,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint A5F3BCE4CBE993573EC5966A65321B8382B219AF [3] (,
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "release-1.16.1-rc1" [5],
> > * website pull request listing the new release and adding announcement
> blog
> > post [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Release Manager
> >
> > NOTE: The maven artifacts have been signed by Chesnay with the key with
> > fingerprint C2EED7B111D464BA
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12352344
> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.16.1-rc1/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1580
> > [5] https://github.com/apache/flink/releases/tag/release-1.16.1-rc1
> > [6] https://github.com/apache/flink-web/pull/603
> >
>
>
> --
> Best
>
> ConradJam
>


Re: [VOTE] FLIP-281: Sink Supports Speculative Execution For Batch Job

2023-01-12 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Lijie Wang  于2023年1月12日周四 16:39写道:

> +1 (binding)
>
> Best,
> Lijie
>
> Martijn Visser  于2023年1月12日周四 15:56写道:
>
> > +0 (binding)
> >
> > Op di 10 jan. 2023 om 13:11 schreef yuxia :
> >
> > > +1 (non-binding).
> > >
> > > Best regards,
> > > Yuxia
> > >
> > > - 原始邮件 -
> > > 发件人: "Zhu Zhu" 
> > > 收件人: "dev" 
> > > 发送时间: 星期二, 2023年 1 月 10日 下午 5:50:39
> > > 主题: Re: [VOTE] FLIP-281: Sink Supports Speculative Execution For Batch
> > Job
> > >
> > > +1 (binding)
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Biao Liu  于2023年1月5日周四 10:37写道:
> > > >
> > > > Hi Martijn,
> > > >
> > > > Sure, thanks for the reminder about the holiday period.
> > > > Looking forward to your feedback!
> > > >
> > > > Thanks,
> > > > Biao /'bɪ.aʊ/
> > > >
> > > >
> > > >
> > > > On Thu, 5 Jan 2023 at 03:07, Martijn Visser <
> martijnvis...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi Biao,
> > > > >
> > > > > To be honest, I haven't read the FLIP yet since this is still a
> > holiday
> > > > > period in Europe. I would like to read it in the next few days. Can
> > you
> > > > > keep the vote open a little longer?
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > > On Wed, Jan 4, 2023 at 1:31 PM Biao Liu 
> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Thanks for all the feedback!
> > > > > >
> > > > > > Based on the discussion[1], we seem to have a consensus. So I'd
> > like
> > > to
> > > > > > start a vote on FLIP-281: Sink Supports Speculative Execution For
> > > Batch
> > > > > > Job[2]. The vote will last for 72 hours, unless there is an
> > > objection or
> > > > > > insufficient votes.
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/l05m6cf8fwkkbpnjtzbg9l2lo40oxzd1
> > > > > > [2]
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-281+Sink+Supports+Speculative+Execution+For+Batch+Job
> > > > > >
> > > > > > Thanks,
> > > > > > Biao /'bɪ.aʊ/
> > > > > >
> > > > >
> > >
> >
>


Re: [DISCUSS] FLIP-281 Sink Supports Speculative Execution For Batch Job

2023-01-03 Thread Jing Zhang
Hi Biao,

Thanks for explanation.

+1 for the proposal.

Best,
Jing Zhang

Lijie Wang  于2023年1月4日周三 12:11写道:

> Hi Biao,
>
> Thanks for the explanation of how SinkV2  knows the right subtask
> attempt. I have no more questions, +1 for the proposal.
>
> Best,
> Lijie
>
> Biao Liu  于2022年12月28日周三 17:22写道:
>
> > Thanks for all your feedback!
> >
> > To @Yuxia,
> >
> > > What the sink expect to do to isolate data produced by speculative
> > > executions?  IIUC, if the taks failover, it also generate a new
> attempt.
> > > Does it make difference in isolating data produced?
> >
> >
> > Yes there is something different from the task failover scenario. The
> > attempt number is more necessary for speculative execution than failover.
> > Because there can be only one subtask instance running at the same time
> in
> > the failover scenario.
> >
> > Let's take FileSystemOutputFormat as an example. For the failover
> scenario,
> > the temporary directory to store produced data can be something like
> > "$root_dir/task-$taskNumber/". At the initialization phase, subtask
> deletes
> > and re-creates the temporary directory.
> >
> > However in the speculative execution scenario, it does not work because
> > there might be several subtasks running at the same time. These subtasks
> > might delete, re-create and write the same temporary directory at the
> > same time. The correct temporary directory should be like
> > "$root_dir/task-$taskNumber/attempt-$attemptNumber". So it's necessary to
> > expose the attempt number to the Sink implementation to do the data
> > isolation.
> >
> >
> > To @Lijie,
> >
> > > I have a question about this: does SinkV2 need to do the same thing?
> >
> >
> > Actually, yes.
> >
> > Should we/users do it in the committer? If yes, how does the commiter
> know
> > > which one is the right subtask attempt?
> >
> >
> > Yes, we/users should do it in the committer.
> >
> > In the current design, the Committer of Sink V2 should get the "which one
> > is the right subtask attempt" information from the "committable data''
> > produced by SinkWriter. Let's take the FileSink as example, the
> > "committable data" sent to the Committer contains the full path of the
> > files produced by SinkWriter. Users could also pass the attempt number
> > through "committable data" from SinkWriter to Committer.
> >
> > In the "Rejected Alternatives -> Introduce a way to clean leaked data of
> > Sink V2" section of the FLIP document, we discussed some of the reasons
> > that we didn't provide the API like OutputFormat.
> >
> > To @Jing Zhang
> >
> > I have a question about this: Speculative execution of Committer will be
> > > disabled.
> >
> > I agree with your point and I saw the similar requirements to disable
> > speculative
> > > execution for specified operators.
> >
> > However the requirement is not supported currently. I think there
> > should be some
> > > place to describe how to support it.
> >
> >
> > In this FLIP design, the speculative execution of Committer of Sink V2
> will
> > be disabled by Flink. It's not an optional operation. Users can not
> change
> > it.
> > And as you said, "disable speculative execution for specified operators"
> is
> > not supported in the FLIP. Because it's a bit out of scope: "Sink
> Supports
> > Speculative Execution For Batch Job". I think it's better to start
> another
> > FLIP to discuss it. "Fine-grained control of enabling speculative
> execution
> > for operators" can be the title of that FLIP. And we can discuss there
> how
> > to enable or disable speculative execution for specified operators
> > including Committer and pre/post-committer of Sink V2.
> >
> > What do you think?
> >
> > Thanks,
> > Biao /'bɪ.aʊ/
> >
> >
> >
> > On Wed, 28 Dec 2022 at 11:30, Jing Zhang  wrote:
> >
> > > Hi Biao,
> > >
> > > Thanks for driving this FLIP. It's meaningful to support speculative
> > > execution
> > > of sinks is important.
> > >
> > > I have a question about this: Speculative execution of Committer will
> be
> > > disabled.
> > >
> > > I agree with your point and I saw the similar requirements to disable
> > > speculative execution for specified operators.
> > 

Re: [DISCUSS] FLIP-281 Sink Supports Speculative Execution For Batch Job

2022-12-27 Thread Jing Zhang
Hi Biao,

Thanks for driving this FLIP. It's meaningful to support speculative execution
of sinks is important.

I have a question about this: Speculative execution of Committer will be
disabled.

I agree with your point and I saw the similar requirements to disable
speculative execution for specified operators.

However the requirement is not supported currently. I think there should be
some place to describe how to support it.

Best,
Jing Zhang

Lijie Wang  于2022年12月27日周二 18:51写道:

> Hi Biao,
>
> Thanks for driving this FLIP.
> In this FLIP, it introduces "int getFinishedAttempt(int subtaskIndex)" for
> OutputFormat to know which subtask attempt is the one marked as finished by
> JM and commit the right data.
> I have a question about this: does SinkV2 need to do the same thing? Should
> we/users do it in the committer? If yes, how does the commiter know which
> one is the right subtask attempt?
>
> Best,
> Lijie
>
> yuxia  于2022年12月27日周二 10:01写道:
>
> > HI, Biao.
> > Thanks for driving this FLIP.
> > After quick look of this FLIP, I have a question about "expose the
> attempt
> > number which can be used to isolate data produced by speculative
> executions
> > with the same subtask id".
> > What the sink expect to do to isolate data produced by speculative
> > executions?  IIUC, if the taks failover, it also generate a new attempt.
> > Does it make difference in isolating data produced?
> >
> > Best regards,
> > Yuxia
> >
> > - 原始邮件 -
> > 发件人: "Biao Liu" 
> > 收件人: "dev" 
> > 发送时间: 星期四, 2022年 12 月 22日 下午 8:16:21
> > 主题: [DISCUSS] FLIP-281 Sink Supports Speculative Execution For Batch Job
> >
> > Hi everyone,
> >
> > I would like to start a discussion about making Sink support speculative
> > execution for batch jobs. This proposal is a follow up of "FLIP-168:
> > Speculative Execution For Batch Job"[1]. Speculative execution is very
> > meaningful for batch jobs. And it would be more complete after supporting
> > speculative execution of Sink. Please find more details in the FLIP
> > document
> > [2].
> >
> > Looking forward to your feedback.
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+Execution+for+Batch+Job
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-281+Sink+Supports+Speculative+Execution+For+Batch+Job
> >
> > Thanks,
> > Biao /'bɪ.aʊ/
> >
>


[jira] [Created] (FLINK-28741) Unexpected result if insert 'false' to boolean column

2022-07-29 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-28741:
--

 Summary: Unexpected result if insert 'false' to boolean column
 Key: FLINK-28741
 URL: https://issues.apache.org/jira/browse/FLINK-28741
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive
Reporter: Jing Zhang


Using hive dialect to insert a string 'false' to boolean column, the result is 
true.

The error could be reproduced in the following ITCase.
{code:java}
@Test
public void testUnExpectedResult() throws ExecutionException, 
InterruptedException {
HiveModule hiveModule = new HiveModule(hiveCatalog.getHiveVersion());
CoreModule coreModule = CoreModule.INSTANCE;
for (String loaded : tableEnv.listModules()) {
tableEnv.unloadModule(loaded);
}
tableEnv.loadModule("hive", hiveModule);
tableEnv.loadModule("core", coreModule);
// create source table
tableEnv.executeSql(
"CREATE TABLE test_table (params string) PARTITIONED BY (`p_date` 
string)");
// prepare a data which value is 'false'
tableEnv.executeSql("insert overwrite test_table partition(p_date = 
'20220612') values ('false')")
.await();
// create target table which only contain one boolean column 
tableEnv.executeSql(
"CREATE TABLE target_table (flag boolean) PARTITIONED BY (`p_date` 
string)");
// 
tableEnv.executeSql(
"insert overwrite table target_table partition(p_date = '20220724') 
"
+ "SELECT params FROM test_table WHERE 
p_date='20220612'").await();
TableImpl flinkTable =
(TableImpl) tableEnv.sqlQuery("select flag from target_table where 
p_date = '20220724'");
   List results = 
CollectionUtil.iteratorToList(flinkTable.execute().collect());
assertEquals(
"[false]", results.toString());
} {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] FLIP-248: Introduce dynamic partition pruning

2022-07-27 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Jingsong Li  于2022年7月27日周三 16:52写道:

> +1
>
> On Wed, Jul 27, 2022 at 3:30 PM Jark Wu  wrote:
> >
> > +1 (binding)
> >
> > Best,
> > Jark
> >
> > On Wed, 27 Jul 2022 at 13:34, Yun Gao 
> wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks for proposing the FLIP!
> > >
> > > Best,
> > > Yun Gao
> > >
> > >
> > > --
> > > From:Jing Ge 
> > > Send Time:2022 Jul. 27 (Wed.) 03:40
> > > To:undefined 
> > > Subject:Re: [VOTE] FLIP-248: Introduce dynamic partition pruning
> > >
> > > +1
> > > Thanks for driving this!
> > >
> > > On Tue, Jul 26, 2022 at 4:01 PM godfrey he 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Thanks for all the feedback so far. Based on the discussion[1] we
> seem
> > > > to have consensus, so I would like to start a vote on FLIP-248 for
> > > > which the FLIP has now also been updated[2].
> > > >
> > > > The vote will last for at least 72 hours (Jul 29th 14:00 GMT) unless
> > > > there is an objection or insufficient votes.
> > > >
> > > > [1] https://lists.apache.org/thread/v0b8pfh0o7rwtlok2mfs5s6q9w5vw8h6
> > > > [2]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > >
>


Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning

2022-07-26 Thread Jing Zhang
Hi, Godfrey
Thanks for updating the FLIP.
It looks good to me now.

Best,
Jing Zhang

godfrey he  于2022年7月26日周二 12:33写道:

> Thanks for all the inputs, I have updated the document and POC code.
>
>
> Best,
> Godfrey
>
> Yun Gao  于2022年7月26日周二 11:11写道:
> >
> > Hi,
> >
> > Thanks all for all the valuable discussion on this FLIP, +1 for
> implementing
> > dynamic partition pruning / dynamic filtering pushdown since it is a key
> optimization
> > to improve the performance on batch processing.
> >
> > Also due to introducing the speculative execution for the batch
> processing, we
> > might also need some consideration for the case with speculative
> execution enabled:
> > 1. The operator coordinator of DynamicFilteringDataCollector should
> ignore the following
> > filtering data in consider of the task might executes for multiple
> attempts.
> > 2. The DynamicFileSplitEnumerator should also implements the
> `SupportsHandleExecutionAttemptSourceEvent`
> > interface, otherwise it would throws exception when received the
> filtering data source event.
> >
> > Best,
> > Yun Gao
> >
> >
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+Execution+for+Batch+Job
> >
> >
> >
> > --
> > From:Jing Ge 
> > Send Time:2022 Jul. 21 (Thu.) 18:56
> > To:dev 
> > Subject:Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning
> >
> > Hi,
> >
> > Thanks for the informative discussion! Looking forward to using dynamic
> > filtering provided by Flink.
> >
> > Best regards,
> > Jing
> >
> > On Tue, Jul 19, 2022 at 3:22 AM godfrey he  wrote:
> >
> > > Hi, Jingong, Jark, Jing,
> > >
> > > Thanks for for the important inputs.
> > > Lake storage is a very important scenario, and consider more generic
> > > and extended case,
> > > I also would like to use "dynamic filtering" concept instead of
> > > "dynamic partition".
> > >
> > > >maybe the FLIP should also demonstrate the EXPLAIN result, which
> > > is also an API.
> > > I will add a section to describe the EXPLAIN result.
> > >
> > > >Does DPP also support streaming queries?
> > > Yes, but for bounded source.
> > >
> > > >it requires the SplitEnumerator must implements new introduced
> > > `SupportsHandleExecutionAttemptSourceEvent` interface,
> > > +1
> > >
> > > I will update the document and the poc code.
> > >
> > > Best,
> > > Godfrey
> > >
> > > Jing Zhang  于2022年7月13日周三 20:22写道:
> > > >
> > > > Hi Godfrey,
> > > > Thanks for driving this discussion.
> > > > This is an important improvement for batch sql jobs.
> > > > I agree with Jingsong to expand the capability to more than just
> > > partitions.
> > > > Besides, I have two points:
> > > > 1. Based on FLIP-248[1],
> > > >
> > > > > Dynamic partition pruning mechanism can improve performance by
> avoiding
> > > > > reading large amounts of irrelevant data, and it works for both
> batch
> > > and
> > > > > streaming queries.
> > > >
> > > > Does DPP also support streaming queries?
> > > > It seems the proposed changes in the FLIP-248 does not work for
> streaming
> > > > queries,
> > > > because the dimension table might be an unbounded inputs.
> > > > Or does it require all dimension tables to be bounded inputs for
> > > streaming
> > > > jobs if the job wanna enable DPP?
> > > >
> > > > 2. I notice there are changes on SplitEnumerator for Hive source and
> File
> > > > source.
> > > > And they now depend on SourceEvent to pass PartitionData.
> > > > In FLIP-245, if enable speculative execution for sources based on
> FLIP-27
> > > > which use SourceEvent,
> > > > it requires the SplitEnumerator must implements new introduced
> > > > `SupportsHandleExecutionAttemptSourceEvent` interface,
> > > > otherwise an exception would be thrown out.
> > > > Since hive and File sources are commonly used for batch jobs, it's
> better
> > > > to take this point into consideration.
> > > >
> > > > Best,
> > > > Jing Zhang
> > > >
> > > > [1] FLIP-248:
>

Re: [DISCUSS] FLIP-248: Introduce dynamic partition pruning

2022-07-13 Thread Jing Zhang
Hi Godfrey,
Thanks for driving this discussion.
This is an important improvement for batch sql jobs.
I agree with Jingsong to expand the capability to more than just partitions.
Besides, I have two points:
1. Based on FLIP-248[1],

> Dynamic partition pruning mechanism can improve performance by avoiding
> reading large amounts of irrelevant data, and it works for both batch and
> streaming queries.

Does DPP also support streaming queries?
It seems the proposed changes in the FLIP-248 does not work for streaming
queries,
because the dimension table might be an unbounded inputs.
Or does it require all dimension tables to be bounded inputs for streaming
jobs if the job wanna enable DPP?

2. I notice there are changes on SplitEnumerator for Hive source and File
source.
And they now depend on SourceEvent to pass PartitionData.
In FLIP-245, if enable speculative execution for sources based on FLIP-27
which use SourceEvent,
it requires the SplitEnumerator must implements new introduced
`SupportsHandleExecutionAttemptSourceEvent` interface,
otherwise an exception would be thrown out.
Since hive and File sources are commonly used for batch jobs, it's better
to take this point into consideration.

Best,
Jing Zhang

[1] FLIP-248:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
[2] FLIP-245:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job


Jark Wu  于2022年7月12日周二 13:16写道:

> I agree with Jingsong. DPP is a particular case of Dynamic Filter Pushdown
> that the join key contains partition fields.  Extending this FLIP to
> general filter
> pushdown can benefit more optimizations, and they can share the same
> interface.
>
> For example, Trino Hive Connector leverages dynamic filtering to support:
> - dynamic partition pruning for partitioned tables
> - and dynamic bucket pruning for bucket tables
> - and dynamic filter pushed into the ORC and Parquet readers to perform
> stripe
>   or row-group pruning and save on disk I/O.
>
> Therefore, +1 to extend this FLIP to Dynamic Filter Pushdown (or Dynamic
> Filtering),
> just like Trino [1].  The interfaces should also be adapted for that.
>
> Besides, maybe the FLIP should also demonstrate the EXPLAIN result, which
> is also an API.
>
> Best,
> Jark
>
> [1]: https://trino.io/docs/current/admin/dynamic-filtering.html
>
>
>
>
>
>
>
>
>
>
> On Tue, 12 Jul 2022 at 09:59, Jingsong Li  wrote:
>
> > Thanks Godfrey for driving.
> >
> > I like this FLIP.
> >
> > We can restrict this capability to more than just partitions.
> > Here are some inputs from Lake Storage.
> >
> > The format of the splits generated by Lake Storage is roughly as follows:
> > Split {
> >Path filePath;
> >Statistics[] fieldStats;
> > }
> >
> > Stats contain the min and max of each column.
> >
> > If the storage is sorted by a column, this means that the split
> > filtering on that column will be very good, so not only the partition
> > field, but also this column is worthy of being pushed down the
> > RuntimeFilter.
> > This information can only be known by source, so I suggest that source
> > return which fields are worthy of being pushed down.
> >
> > My overall point is:
> > This FLIP can be extended to support Source Runtime Filter push-down
> > for all fields, not just dynamic partition pruning.
> >
> > What do you think?
> >
> > Best,
> > Jingsong
> >
> > On Fri, Jul 8, 2022 at 10:12 PM godfrey he  wrote:
> > >
> > > Hi all,
> > >
> > > I would like to open a discussion on FLIP-248: Introduce dynamic
> > > partition pruning.
> > >
> > >  Currently, Flink supports static partition pruning: the conditions in
> > > the WHERE clause are analyzed
> > > to determine in advance which partitions can be safely skipped in the
> > > optimization phase.
> > > Another common scenario: the partitions information is not available
> > > in the optimization phase but in the execution phase.
> > > That's the problem this FLIP is trying to solve: dynamic partition
> > > pruning, which could reduce the partition table source IO.
> > >
> > > The query pattern looks like:
> > > select * from store_returns, date_dim where sr_returned_date_sk =
> > > d_date_sk and d_year = 2000
> > >
> > > We will introduce a mechanism for detecting dynamic partition pruning
> > > patterns in optimization phase
> > > and performing partition pruning at runtime by sending the dimension
> > > table results to the SplitEnumerator
> > > of fact table via existing coordinator mechanism.
> > >
> > > You can find more details in FLIP-248 document[1].
> > > Looking forward to your any feedback.
> > >
> > > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-248%3A+Introduce+dynamic+partition+pruning
> > > [2] POC: https://github.com/godfreyhe/flink/tree/FLIP-248
> > >
> > >
> > > Best,
> > > Godfrey
> >
>


Re: [VOTE] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-13 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Gen Luo  于2022年7月13日周三 14:49写道:

> Hi Jing,
>
> I have replied in the discussion thread about the questions. Hope that
> would be helpful.
>
> Best,
> Gen
>
> On Tue, Jul 12, 2022 at 8:43 PM Jing Zhang  wrote:
>
> > Hi, Gen Luo,
> >
> > I left  two minor questions in the DISCUSS thread.
> > Sorry for jumping into the discussion so late.
> >
> > Best,
> > Jing Zhang
> >
> > Lijie Wang  于2022年7月12日周二 19:29写道:
> >
> > > +1 (non-binding)
> > >
> > > Best,
> > > Lijie
> > >
> > > Zhu Zhu  于2022年7月12日周二 17:38写道:
> > >
> > > > +1 (binding)
> > > >
> > > > Thanks,
> > > > Zhu
> > > >
> > > > Gen Luo  于2022年7月12日周二 13:46写道:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > >
> > > > > Thanks for all the feedback so far. Based on the discussion [1], we
> > > seem
> > > > to
> > > > > have consensus. So, I would like to start a vote on FLIP-249 [2].
> > > > >
> > > > >
> > > > > The vote will last for at least 72 hours unless there is an
> objection
> > > or
> > > > > insufficient votes.
> > > > >
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/832tk3zvysg45vrqrv5tgbdqx974pm3m
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-249%3A+Flink+Web+UI+Enhancement+for+Speculative+Execution
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-13 Thread Jing Zhang
Hi Gen,

> The way the speculative executions are presented should be almost the
same as the
job was running. Users can still find the executions folded in the subtask
list page.

It's a more complicated operation to check all vertex and all subtasks list
page.
It's better to have an easier way to know whether the job contains
speculative executions
even after the job finished.
Maybe the point could be took into consideration in the next version.

Best,
Jing Zhang


Gen Luo  于2022年7月13日周三 14:47写道:

> Hi Jing,
>
> Thanks for joining the discussion. It's a very good point to figure out the
> possible influence on the history server.
>
> > 1. Does the improvement also cover history server or just Web UI?
> As far as I know most Web UI components are shared between
> runtime and history server, so the improvement is expected to cover both.
>
> We will make sure the changes proposed in this FLIP do not conflict with
> the ongoing FLIP-241 which is working on the enhancement of completed
> job information.
>
> > 2. How to know whether the job contains speculative execution
> instances after the job finished? Do we have to check each subtasks
> of all vertex one by one?
>
> When one attempt of a subtask finishes, all other concurrent attempts
> will be canceled, but still treated as the current executions. The way the
> speculative executions are presented should be almost the same as the
> job was running. Users can still find the executions folded in the subtask
> list page.
>
> As we mentioned in the FLIP, all changes are expected to be transparent
> to users who don't use speculative execution. And to users who do use
> speculative execution, the experience should be almost the same
> when watching a running job or a completed job in the history server.
>
> Best,
> Gen
>
> On Tue, Jul 12, 2022 at 8:41 PM Jing Zhang  wrote:
>
> > Thanks for driving this discussion. It's a very helpful improvement.
> > I only have two minor questions:
> > 1. Does the improvement also cover history server or just Web UI?
> > 2. How to know whether the job contains speculative execution instances
> > after the job finished?
> > Do we have to check each subtasks of all vertex one by one?
> >
> > Best,
> > Jing Zhang
> >
> > Gen Luo  于2022年7月11日周一 22:31写道:
> >
> > > Hi, everyone.
> > >
> > > Thanks for your feedback.
> > > If there are no more concerns or comments, I will start the vote
> > tomorrow.
> > >
> > > Gen Luo  于 2022年7月11日周一 11:12写道:
> > >
> > > > Hi Lijie and Zhu,
> > > >
> > > > Thanks for the suggestion. I agree that the name "Blocked Free Slots"
> > is
> > > > more clear to users.
> > > > I'll take the suggestion and update the FLIP.
> > > >
> > > > On Fri, Jul 8, 2022 at 9:12 PM Zhu Zhu  wrote:
> > > >
> > > >> I agree that it can be more useful to show the number of slots that
> > are
> > > >> free but blocked. Currently users infer the slots in use by
> > subtracting
> > > >> available slots from the total slots. With blocked slots introduced,
> > > this
> > > >> can be achieved by subtracting available slots and blocked free
> slots
> > > >> from the total slots.
> > > >>
> > > >> Therefore, +1 to show "Blocked Free Slots" on the resource card.
> > > >>
> > > >> Thanks,
> > > >> Zhu
> > > >>
> > > >> Lijie Wang  于2022年7月8日周五 17:39写道:
> > > >> >
> > > >> > Hi Gen & Zhu,
> > > >> >
> > > >> > -> 1. Can we also show "Blocked Slots" in the resource card, so
> that
> > > >> users
> > > >> > can easily figure out how many slots are available/blocked/in-use?
> > > >> >
> > > >> > I think we should describe the "available" and "blocked" more
> > clearly.
> > > >> In
> > > >> > my opinion, I think users should be interested in the number of
> > slots
> > > in
> > > >> > the following 3 state:
> > > >> > 1. free and unblocked, I think it's OK to call this state
> > "available".
> > > >> > 2. free and blocked, I think it's not appropriate to call
> "blocked"
> > > >> > directly, because "blocked" should include both the "free and
> > blocked"
> > > >> and
> > > >

Re: [VOTE] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-12 Thread Jing Zhang
Hi, Gen Luo,

I left  two minor questions in the DISCUSS thread.
Sorry for jumping into the discussion so late.

Best,
Jing Zhang

Lijie Wang  于2022年7月12日周二 19:29写道:

> +1 (non-binding)
>
> Best,
> Lijie
>
> Zhu Zhu  于2022年7月12日周二 17:38写道:
>
> > +1 (binding)
> >
> > Thanks,
> > Zhu
> >
> > Gen Luo  于2022年7月12日周二 13:46写道:
> > >
> > > Hi everyone,
> > >
> > >
> > > Thanks for all the feedback so far. Based on the discussion [1], we
> seem
> > to
> > > have consensus. So, I would like to start a vote on FLIP-249 [2].
> > >
> > >
> > > The vote will last for at least 72 hours unless there is an objection
> or
> > > insufficient votes.
> > >
> > >
> > > [1] https://lists.apache.org/thread/832tk3zvysg45vrqrv5tgbdqx974pm3m
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-249%3A+Flink+Web+UI+Enhancement+for+Speculative+Execution
> >
>


Re: [DISCUSS] FLIP-249: Flink Web UI Enhancement for Speculative Execution

2022-07-12 Thread Jing Zhang
Thanks for driving this discussion. It's a very helpful improvement.
I only have two minor questions:
1. Does the improvement also cover history server or just Web UI?
2. How to know whether the job contains speculative execution instances
after the job finished?
Do we have to check each subtasks of all vertex one by one?

Best,
Jing Zhang

Gen Luo  于2022年7月11日周一 22:31写道:

> Hi, everyone.
>
> Thanks for your feedback.
> If there are no more concerns or comments, I will start the vote tomorrow.
>
> Gen Luo  于 2022年7月11日周一 11:12写道:
>
> > Hi Lijie and Zhu,
> >
> > Thanks for the suggestion. I agree that the name "Blocked Free Slots" is
> > more clear to users.
> > I'll take the suggestion and update the FLIP.
> >
> > On Fri, Jul 8, 2022 at 9:12 PM Zhu Zhu  wrote:
> >
> >> I agree that it can be more useful to show the number of slots that are
> >> free but blocked. Currently users infer the slots in use by subtracting
> >> available slots from the total slots. With blocked slots introduced,
> this
> >> can be achieved by subtracting available slots and blocked free slots
> >> from the total slots.
> >>
> >> Therefore, +1 to show "Blocked Free Slots" on the resource card.
> >>
> >> Thanks,
> >> Zhu
> >>
> >> Lijie Wang  于2022年7月8日周五 17:39写道:
> >> >
> >> > Hi Gen & Zhu,
> >> >
> >> > -> 1. Can we also show "Blocked Slots" in the resource card, so that
> >> users
> >> > can easily figure out how many slots are available/blocked/in-use?
> >> >
> >> > I think we should describe the "available" and "blocked" more clearly.
> >> In
> >> > my opinion, I think users should be interested in the number of slots
> in
> >> > the following 3 state:
> >> > 1. free and unblocked, I think it's OK to call this state "available".
> >> > 2. free and blocked, I think it's not appropriate to call "blocked"
> >> > directly, because "blocked" should include both the "free and blocked"
> >> and
> >> > "in-use and blocked".
> >> > 3. in-use
> >> >
> >> > And the sum of the aboved 3 kind of slots should be the total number
> of
> >> > slots in this cluster.
> >> >
> >> > WDYT?
> >> >
> >> > Best,
> >> > Lijie
> >> >
> >> > Gen Luo  于2022年7月8日周五 16:14写道:
> >> >
> >> > > Hi Zhu,
> >> > > Thanks for the feedback!
> >> > >
> >> > > 1.Good idea. Users should be more familiar with the slots as the
> >> resource
> >> > > units.
> >> > >
> >> > > 2.You remind me that the "speculative attempts" are execution
> attempts
> >> > > started by the SpeculativeScheduler when slot tasks are detected,
> >> while the
> >> > > current execution attempts other than the "most current" one are not
> >> really
> >> > > the speculative attempts. I agree we should modify the field name.
> >> > >
> >> > > 3.ArchivedSpeculativeExecutionVertex seems to be introduced with the
> >> > > speculative execution to handle the speculative attempts as a part
> of
> >> the
> >> > > execution history. Since this FLIP is handling the attempts with a
> >> more
> >> > > proper way, I agree that we can remove the
> >> > > ArchivedSpeculativeExecutionVertex.
> >> > >
> >> > > Thanks again and I'll update the FLIP later according to these
> >> suggestions.
> >> > >
> >> > > On Thu, Jul 7, 2022 at 4:35 PM Zhu Zhu  wrote:
> >> > >
> >> > > > Thanks for writing this FLIP and initiating the discussion, Gen,
> >> Yun and
> >> > > > Junhan!
> >> > > > It will be very useful to have these improvements on the web UI
> for
> >> > > > speculative execution users, allowing them to know what is
> >> happening.
> >> > > > I just have a few comment regarding the design details:
> >> > > >
> >> > > > 1. Can we also show "Blocked Slots" in the resource card, so that
> >> users
> >> > > > can easily figure out how many slots are available/blocked/in-use?
> >> > > > 2. I think "speculati

Re: [VOTE] Creating benchmark channel in Apache Flink slack

2022-07-11 Thread Jing Zhang
+ 1 (binding),
Thanks for driving this.

Best,
Jing Zhang

Yun Tang  于2022年7月12日周二 11:19写道:

> + 1 (binding, I think this voting acts as voting a FLIP).
>
>
>
> Best,
> Yun Tang
> 
> From: Alexander Fedulov 
> Sent: Tuesday, July 12, 2022 5:27
> To: dev 
> Subject: Re: [VOTE] Creating benchmark channel in Apache Flink slack
>
> +1. Thanks, Anton.
>
> Best,
> Alexander Fedulov
>
>
>
> On Mon, Jul 11, 2022 at 3:17 PM Márton Balassi 
> wrote:
>
> > +1 (binding). Thanks!
> >
> > I can help you with the Slack admin steps if needed.
> >
> > On Mon, Jul 11, 2022 at 10:55 AM godfrey he  wrote:
> >
> > > +1, Thanks for driving this!
> > >
> > > Best,
> > > Godfrey
> > >
> > > Yuan Mei  于2022年7月11日周一 16:13写道:
> > > >
> > > > +1 (binding) & thanks for the efforts!
> > > >
> > > > Best
> > > > Yuan
> > > >
> > > >
> > > >
> > > > On Mon, Jul 11, 2022 at 2:08 PM Yun Gao  >
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Thanks Anton for driving this!
> > > > >
> > > > >
> > > > > Best,
> > > > > Yun Gao
> > > > >
> > > > >
> > > > > --
> > > > > From:Anton Kalashnikov 
> > > > > Send Time:2022 Jul. 8 (Fri.) 22:59
> > > > > To:undefined 
> > > > > Subject:[VOTE] Creating benchmark channel in Apache Flink slack
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > >
> > > > > I would like to start a vote for creating the new channel in Apache
> > > > > Flink slack for sending benchamrk's result to it. This should help
> > the
> > > > > community to notice the performance degradation on time.
> > > > >
> > > > > The discussion of this idea can be found here[1]. The ticket for
> this
> > > is
> > > > > here[2].
> > > > >
> > > > >
> > > > > [1]
> https://www.mail-archive.com/dev@flink.apache.org/msg58666.html
> > > > >
> > > > > [2] https://issues.apache.org/jira/browse/FLINK-28468
> > > > >
> > > > > --
> > > > >
> > > > > Best regards,
> > > > > Anton Kalashnikov
> > >
> >
>


[RESULT][VOTE] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-08 Thread Jing Zhang
Hi everyone,
I’m happy to announce that FLIP-245[1] has been accepted, with 5 approving
votes, 3 of which are binding[2]:
- Mang Zhang
- Jiangang Liu
- Guowei Ma (binding)
- Zhu Zhu (binding)
- Jing Zhang (binding)

There is no disapproving vote.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
[2] https://lists.apache.org/thread/ns77scmf4xb3kv50hxybs57lbbnrnv9x

Best,
Jing Zhang


Re: [VOTE] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-07 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Zhu Zhu  于2022年7月6日周三 10:45写道:

> +1 (binding)
>
> Thanks,
> Zhu
>
> Guowei Ma  于2022年7月5日周二 17:38写道:
> >
> > +1 (binding)
> > Best,
> > Guowei
> >
> >
> > On Tue, Jul 5, 2022 at 12:38 PM Jiangang Liu 
> > wrote:
> >
> > > +1 for the feature.
> > >
> > > Jing Zhang  于2022年7月5日周二 11:43写道:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to start a vote for FLIP-245: Source Supports Speculative
> > > > Execution For Batch Job[1] on the discussion thread [2].
> > > >
> > > > The vote will last for at least 72 hours unless there is an
> objection or
> > > > insufficient votes.
> > > >
> > > > Best,
> > > > Jing Zhang
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
> > > > [2] https://lists.apache.org/thread/zvc5no4yxvwkto7xxpw1vo7j1p6h0lso
> > > >
> > >
>


Re: Re: Re: [VOTE] FLIP-218: Support SELECT clause in CREATE TABLE(CTAS)

2022-07-07 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Mang Zhang  于2022年7月7日周四 14:39写道:

> Hi Godfrey,
> Thank you very much for this vote!
>
>
> Since the titles of the two vote emails are the same, making it easy to
> confuse the emails, I will count the response results of these two emails
> when I finally count the vote results!
>
>
> the first vote email list:
> https://lists.apache.org/thread/rohf9ytqznmq3mm7lj3mjbvtsng7czsh
>
>
> the second vote email list:
> https://lists.apache.org/thread/yd0lchppk59r93dr37gfchdg8jq7mvwx
>
>
>
>
> --
>
> Best regards,
> Mang Zhang
>
>
>
>
>
> At 2022-07-07 12:26:48, "godfrey he"  wrote:
> >+1, thanks for driving this
> >
> >Best,
> >Godfrey
> >
> >Mang Zhang  于2022年7月5日周二 11:56写道:
> >>
> >> Hi everyone,
> >> I'm sorry to bother you all, but since FLIP-218[1] has been updated,
> I'm going to relaunch VOTE.
> >> The main contents of the modification are:
> >> 1. remove rtas from option name
> >> 2. no longer introduce AtomicCatalog, add javadocs description for
> Catalog interface:
> >> If Catalog needs to support the atomicity feature of CTAS, then Catalog
> must implement Serializable and make the Catalog instances can be
> serializable/deserializable using Java serialization.
> >> When atomicity support for CTAS is enabled, Planner will check if the
> Catalog instance can be serialized using the Java serialization.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185
> >>
> >>
> >>
> >>
> >> --
> >>
> >> Best regards,
> >> Mang Zhang
> >>
> >>
> >>
> >>
> >>
> >> At 2022-07-05 09:19:24, "Jiangang Liu" 
> wrote:
> >> >+1 for the feature.
> >> >
> >> >Jark Wu  于2022年7月4日周一 17:33写道:
> >> >
> >> >> Hi Mang,
> >> >>
> >> >> I left a comment in the DISCUSS thread.
> >> >>
> >> >> Best,
> >> >> Jark
> >> >>
> >> >> On Mon, 4 Jul 2022 at 15:24, Rui Fan <1996fan...@gmail.com> wrote:
> >> >>
> >> >> > Hi.
> >> >> >
> >> >> > Thanks Mang for this FLIP. I think it will be useful for users.
> >> >> >
> >> >> > +1(non-binding)
> >> >> >
> >> >> > Best wishes
> >> >> > Rui Fan
> >> >> >
> >> >> > On Mon, Jul 4, 2022 at 3:01 PM Mang Zhang 
> wrote:
> >> >> >
> >> >> > > Hi everyone,
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > Thanks for all the feedback so far. Based on the discussion [1],
> we
> >> >> seem
> >> >> > > to have consensus. So, I would like to start a vote on FLIP-218
> [2].
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > The vote will last for at least 72 hours unless there is an
> objection
> >> >> or
> >> >> > > insufficient votes.
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > [1]
> https://lists.apache.org/thread/mc0lv4gptm7som02hpob1hdp3hb1ps1v
> >> >> > >
> >> >> > > [2]
> >> >> > >
> >> >> >
> >> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=199541185
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > >
> >> >> > > Best regards,
> >> >> > > Mang Zhang
> >> >> >
> >> >>
>


[jira] [Created] (FLINK-28397) Source Supports Speculative Execution For Batch Job

2022-07-05 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-28397:
--

 Summary: Source Supports Speculative Execution For Batch Job
 Key: FLINK-28397
 URL: https://issues.apache.org/jira/browse/FLINK-28397
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.16.0
Reporter: Jing Zhang
Assignee: Jing Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[VOTE] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-04 Thread Jing Zhang
Hi all,

I'd like to start a vote for FLIP-245: Source Supports Speculative
Execution For Batch Job[1] on the discussion thread [2].

The vote will last for at least 72 hours unless there is an objection or
insufficient votes.

Best,
Jing Zhang

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
[2] https://lists.apache.org/thread/zvc5no4yxvwkto7xxpw1vo7j1p6h0lso


Re: [DISCUSS] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-04 Thread Jing Zhang
Hi everyone,

Thanks a lot for all the feedback!
I will open a vote for it since there is no more concern.

Best,
Jing Zhang

Jing Zhang  于2022年7月5日周二 11:31写道:

> Hi ZhuZhu, Jiangjie,
>
> Thanks a lot for your feedback.
>
> I agree that it's better to support most existing events.
> I have updated the FLIP to cover how to deal with the
> RequestSplitEvent/SourceEventWrapper/ReaderRegistrationEvent.
>
> The ReportedWatermarkEvent is only used in watermark alignment.
> Watermark alignment is a new feature, and still evolving.
> Moreover, most users will not use this feature in batch cases.
> So I agree not to support it in speculative execution.
>
> Best,
> Jing Zhang
>
> Becket Qin  于2022年7月5日周二 08:38写道:
>
>> Yes, that sounds reasonable to me. That said, supporting custom events
>> might still be preferable if that does not complicate the design too much.
>> It would be good to avoid having a tricky feature availability matrix when
>> we add new features to the project.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>> On Mon, Jul 4, 2022 at 5:09 PM Zhu Zhu  wrote:
>>
>>> Hi Jiangjie,
>>>
>>> Yes you are that the goals of watermark alignment and speculative
>>> execution do not conflict. For the example you gave, we can make it
>>> work by only aligning watermarks for executions that are pipelined
>>> connected (i.e. in the same execution attempt level pipelined region).
>>> Even not considering speculative execution, it looks like to be a
>>> possible improvement of watermark alignment, for streaming jobs that
>>> contains embarrassingly parallel job vertices, so that a slow task
>>> does not cause unconnected tasks to be throttled.
>>>
>>> At the moment, given that it is not needed yet and to avoid further
>>> complicating things, I think it's fine to not support watermark
>>> alignment in speculative execution cases.
>>>
>>> WDYT?
>>>
>>> Thanks,
>>> Zhu
>>>
>>> Becket Qin  于2022年7月4日周一 16:15写道:
>>> >
>>> > Hi Zhu,
>>> >
>>> > I agree that if we are talking about a single execution region with
>>> > blocking shuffle, watermark alignment may not be that helpful as the
>>> > subtasks are running independently of each other.
>>> >
>>> > That said, I don't think watermark alignment and speculative execution
>>> > necessarily conflict with each other. The idea of watermark alignment
>>> is to
>>> > make sure the jobs run efficiently, regardless of whether or why the
>>> job
>>> > has performance issues. On the other hand, the purpose of speculative
>>> > execution is to find out whether the jobs have performance issues due
>>> to
>>> > slow tasks, and fix them.
>>> >
>>> > For example, a job has one task whose watermark is always lagging
>>> behind,
>>> > therefore it causes the other tasks to be throttled. The speculative
>>> > execution identified the slow task and decided to run it in another
>>> node,
>>> > thus unblocking the other subtasks.
>>> >
>>> > Thanks,
>>> >
>>> > Jiangjie (Becket) Qin
>>> >
>>> >
>>> >
>>> > On Mon, Jul 4, 2022 at 3:31 PM Zhu Zhu  wrote:
>>> >
>>> > > I had another thought and now I think watermark alignment is actually
>>> > > conceptually conflicted with speculative execution.
>>> > > This is because the idea of watermark alignment is to limit the
>>> progress
>>> > > of all sources to be around the progress of the slowest source in the
>>> > > watermark group. However, speculative execution's goal is to solve
>>> the
>>> > > slow task problem and it never wants to limit the progress of tasks
>>> with
>>> > > the progress of the slow task.
>>> > > Therefore, I think it's fine to not support watermark alignment.
>>> Instead,
>>> > > it should throw an error if watermark alignment is enabled in the
>>> case
>>> > > that speculative execution is enabled.
>>> > >
>>> > > Thanks,
>>> > > Zhu
>>> > >
>>> > > Zhu Zhu  于2022年7月4日周一 14:34写道:
>>> > > >
>>> > > > Thanks for updating the FLIP!
>>> > > >
>>> > > > I agree that at the moment users do not need waterm

Re: [DISCUSS] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-04 Thread Jing Zhang
Hi ZhuZhu, Jiangjie,

Thanks a lot for your feedback.

I agree that it's better to support most existing events.
I have updated the FLIP to cover how to deal with the
RequestSplitEvent/SourceEventWrapper/ReaderRegistrationEvent.

The ReportedWatermarkEvent is only used in watermark alignment.
Watermark alignment is a new feature, and still evolving.
Moreover, most users will not use this feature in batch cases.
So I agree not to support it in speculative execution.

Best,
Jing Zhang

Becket Qin  于2022年7月5日周二 08:38写道:

> Yes, that sounds reasonable to me. That said, supporting custom events
> might still be preferable if that does not complicate the design too much.
> It would be good to avoid having a tricky feature availability matrix when
> we add new features to the project.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Mon, Jul 4, 2022 at 5:09 PM Zhu Zhu  wrote:
>
>> Hi Jiangjie,
>>
>> Yes you are that the goals of watermark alignment and speculative
>> execution do not conflict. For the example you gave, we can make it
>> work by only aligning watermarks for executions that are pipelined
>> connected (i.e. in the same execution attempt level pipelined region).
>> Even not considering speculative execution, it looks like to be a
>> possible improvement of watermark alignment, for streaming jobs that
>> contains embarrassingly parallel job vertices, so that a slow task
>> does not cause unconnected tasks to be throttled.
>>
>> At the moment, given that it is not needed yet and to avoid further
>> complicating things, I think it's fine to not support watermark
>> alignment in speculative execution cases.
>>
>> WDYT?
>>
>> Thanks,
>> Zhu
>>
>> Becket Qin  于2022年7月4日周一 16:15写道:
>> >
>> > Hi Zhu,
>> >
>> > I agree that if we are talking about a single execution region with
>> > blocking shuffle, watermark alignment may not be that helpful as the
>> > subtasks are running independently of each other.
>> >
>> > That said, I don't think watermark alignment and speculative execution
>> > necessarily conflict with each other. The idea of watermark alignment
>> is to
>> > make sure the jobs run efficiently, regardless of whether or why the job
>> > has performance issues. On the other hand, the purpose of speculative
>> > execution is to find out whether the jobs have performance issues due to
>> > slow tasks, and fix them.
>> >
>> > For example, a job has one task whose watermark is always lagging
>> behind,
>> > therefore it causes the other tasks to be throttled. The speculative
>> > execution identified the slow task and decided to run it in another
>> node,
>> > thus unblocking the other subtasks.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> >
>> > On Mon, Jul 4, 2022 at 3:31 PM Zhu Zhu  wrote:
>> >
>> > > I had another thought and now I think watermark alignment is actually
>> > > conceptually conflicted with speculative execution.
>> > > This is because the idea of watermark alignment is to limit the
>> progress
>> > > of all sources to be around the progress of the slowest source in the
>> > > watermark group. However, speculative execution's goal is to solve the
>> > > slow task problem and it never wants to limit the progress of tasks
>> with
>> > > the progress of the slow task.
>> > > Therefore, I think it's fine to not support watermark alignment.
>> Instead,
>> > > it should throw an error if watermark alignment is enabled in the case
>> > > that speculative execution is enabled.
>> > >
>> > > Thanks,
>> > > Zhu
>> > >
>> > > Zhu Zhu  于2022年7月4日周一 14:34写道:
>> > > >
>> > > > Thanks for updating the FLIP!
>> > > >
>> > > > I agree that at the moment users do not need watermark alignment(in
>> > > > which case ReportedWatermarkEvent would happen) in batch cases.
>> > > > However, I think the concept of watermark alignment is not
>> conflicted
>> > > > with speculative execution. It can work with speculative execution
>> with
>> > > > a little extra effort, by sending the WatermarkAlignmentEvent to all
>> > > > the current executions of each subtask.
>> > > > Therefore, I prefer to support watermark alignment in case it will
>> be
>> > > > needed by batch jobs in the future.
>> >

Re: [DISCUSS] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-01 Thread Jing Zhang
Hi all,
After an offline discussion with Jiangjie (Becket) Qin, Guowei, Zhuzhu,
I've updated the FLIP-245[1] to including:
1. Complete the fault-tolerant processing flow.
2. Support for SourceEvent because it's useful for some user-defined
sources which have a custom event protocol between reader and enumerator.
3. How to handle ReportedWatermarkEvent/ReaderRegistrationEvent messages.

Please review the FLIP-245[1] again, looking forward to your feedback.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job

Jing Zhang  于2022年7月1日周五 18:02写道:

> Hi Guowei,
> Thanks a lot for your feedback.
> Your advices are really helpful.  I've updated the FLIP-245[1] to includes
> these parts.
> > First of all, please complete the fault-tolerant processing flow in the
> FLIP.
>
> After an execution is created and a source operator becomes ready to
> receive events,  subtaskReady is called, SpeculativeSourceCoordinator would
> store the mapping of SubtaskGateway to execution attempt in
> SpeculativeSourceCoordinatorContext.
> Then source operator registers the reader to the coordinator,
> SpeculativeSourceCoordinator would store the mapping of source reader to
> execution attempt in SpeculativeSourceCoordinatorContext.
> If the execution goes through a failover, subtaskFailed is called,
> SpeculativeSourceCoordinator would clear information about this execution,
> including source readers and SubtaskGateway.
> If all the current executions of the execution vertex are failed,
> subtaskReset would be called, SpeculativeSourceCoordinator would clear all
> information about this executions and adding splits back to the split
> enumerator of source.
>
> > Secondly the FLIP only says that user-defined events are not supported,
> but it does not explain how to deal with the existing
> ReportedWatermarkEvent/ReaderRegistrationEvent.
>
> For ReaderRegistrationEvent:
> When source operator registers the reader to the coordinator,
> SpeculativeSourceCoordinator would also store the mapping of source reader
> to execution attempt in SpeculativeSourceCoordinatorContext. Like
> SourceCoordinator, it also needs to call SplitEnumerator#addReader to add a
> new source reader.
> Besides, in order to distinguish source reader between different
> execution, 'ReaderInfo' need to add 'attemptId' field.
>
> For ReportedWatermarkEvent:
> ReportedWatermarkEvent is introduced in 1.15 which is used to support
> watermark alignment in streaming mode.
> Speculative execution is only enabled in batch mode. Therefore,
> SpeculativeSourceCoordinator would thrown an exception if receive a
> ReportedWatermarkEvent.
>
> Besides, after offline discussion with Jiangjie (Becket) Qin, I've add
> support for SourceEvent because it's useful for some user-defined sources
> which have a custom event protocol between reader and enumerator.
>
> Best,
> Jing Zhang
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
>
> Guowei Ma  于2022年6月29日周三 18:06写道:
>
>> Hi, Jing
>>
>> Thanks a lot for writing this FLIP, which is very useful to Batch users.
>> Currently  I have only two small questions:
>>
>> 1. First of all, please complete the fault-tolerant processing flow in the
>> FLIP. (Maybe you've already considered it, but it's better to explicitly
>> give the specific solution in the FLIP.)
>> For example, how to handle Source `Reader` in case of error. As far as I
>> know, once the reader is unavailable, it will result in the inability to
>> allocate a new split, which may be unacceptable in the case of speculative
>> execution.
>>
>> 2. Secondly the FLIP only says that user-defined events are not supported,
>> but it does not explain how to deal with the existing
>> ReportedWatermarkEvent/ReaderRegistrationEvent. After all, in the case of
>> speculative execution, there may be two "same" tasks being executed at the
>> same time. If these events are repeated, whether they really have no
>> effect
>> on the execution of the job, there is still a clear evaluation.
>>
>> Best,
>> Guowei
>>
>>
>> On Fri, Jun 24, 2022 at 5:41 PM Jing Zhang  wrote:
>>
>> > Hi all,
>> > One major problem of Flink batch jobs is slow tasks running on hot/bad
>> > nodes, resulting in very long execution time.
>> >
>> > In order to solve this problem, FLIP-168: Speculative Execution for
>> Batch
>> > Job[1] is introduced and approved recently.
>> >
>> > Here, Zhu Zhu and I propose to support speculative execution of sources
>> as
>> >

Re: [DISCUSS] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-07-01 Thread Jing Zhang
Hi Guowei,
Thanks a lot for your feedback.
Your advices are really helpful.  I've updated the FLIP-245[1] to includes
these parts.
> First of all, please complete the fault-tolerant processing flow in the
FLIP.

After an execution is created and a source operator becomes ready to
receive events,  subtaskReady is called, SpeculativeSourceCoordinator would
store the mapping of SubtaskGateway to execution attempt in
SpeculativeSourceCoordinatorContext.
Then source operator registers the reader to the coordinator,
SpeculativeSourceCoordinator would store the mapping of source reader to
execution attempt in SpeculativeSourceCoordinatorContext.
If the execution goes through a failover, subtaskFailed is called,
SpeculativeSourceCoordinator would clear information about this execution,
including source readers and SubtaskGateway.
If all the current executions of the execution vertex are failed,
subtaskReset would be called, SpeculativeSourceCoordinator would clear all
information about this executions and adding splits back to the split
enumerator of source.

> Secondly the FLIP only says that user-defined events are not supported,
but it does not explain how to deal with the existing
ReportedWatermarkEvent/ReaderRegistrationEvent.

For ReaderRegistrationEvent:
When source operator registers the reader to the coordinator,
SpeculativeSourceCoordinator would also store the mapping of source reader
to execution attempt in SpeculativeSourceCoordinatorContext. Like
SourceCoordinator, it also needs to call SplitEnumerator#addReader to add a
new source reader.
Besides, in order to distinguish source reader between different execution,
'ReaderInfo' need to add 'attemptId' field.

For ReportedWatermarkEvent:
ReportedWatermarkEvent is introduced in 1.15 which is used to support
watermark alignment in streaming mode.
Speculative execution is only enabled in batch mode. Therefore,
SpeculativeSourceCoordinator would thrown an exception if receive a
ReportedWatermarkEvent.

Besides, after offline discussion with Jiangjie (Becket) Qin, I've add
support for SourceEvent because it's useful for some user-defined sources
which have a custom event protocol between reader and enumerator.

Best,
Jing Zhang

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job

Guowei Ma  于2022年6月29日周三 18:06写道:

> Hi, Jing
>
> Thanks a lot for writing this FLIP, which is very useful to Batch users.
> Currently  I have only two small questions:
>
> 1. First of all, please complete the fault-tolerant processing flow in the
> FLIP. (Maybe you've already considered it, but it's better to explicitly
> give the specific solution in the FLIP.)
> For example, how to handle Source `Reader` in case of error. As far as I
> know, once the reader is unavailable, it will result in the inability to
> allocate a new split, which may be unacceptable in the case of speculative
> execution.
>
> 2. Secondly the FLIP only says that user-defined events are not supported,
> but it does not explain how to deal with the existing
> ReportedWatermarkEvent/ReaderRegistrationEvent. After all, in the case of
> speculative execution, there may be two "same" tasks being executed at the
> same time. If these events are repeated, whether they really have no effect
> on the execution of the job, there is still a clear evaluation.
>
> Best,
> Guowei
>
>
> On Fri, Jun 24, 2022 at 5:41 PM Jing Zhang  wrote:
>
> > Hi all,
> > One major problem of Flink batch jobs is slow tasks running on hot/bad
> > nodes, resulting in very long execution time.
> >
> > In order to solve this problem, FLIP-168: Speculative Execution for Batch
> > Job[1] is introduced and approved recently.
> >
> > Here, Zhu Zhu and I propose to support speculative execution of sources
> as
> > one of follow up of FLIP-168. You could find more details in FLIP-245[2].
> > Looking forward to your feedback.
> >
> > Best,
> > Jing Zhang
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+Execution+for+Batch+Job#FLIP168:SpeculativeExecutionforBatchJob-NointegrationwithFlink'swebUI
> >
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job
> >
>


[DISCUSS] FLIP-245: Source Supports Speculative Execution For Batch Job

2022-06-24 Thread Jing Zhang
Hi all,
One major problem of Flink batch jobs is slow tasks running on hot/bad
nodes, resulting in very long execution time.

In order to solve this problem, FLIP-168: Speculative Execution for Batch
Job[1] is introduced and approved recently.

Here, Zhu Zhu and I propose to support speculative execution of sources as
one of follow up of FLIP-168. You could find more details in FLIP-245[2].
Looking forward to your feedback.

Best,
Jing Zhang

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+Execution+for+Batch+Job#FLIP168:SpeculativeExecutionforBatchJob-NointegrationwithFlink'swebUI

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-245%3A+Source+Supports+Speculative+Execution+For+Batch+Job


Re: [DISCUSS] Maintain a Calcite repository for Flink to accelerate the development for Flink SQL features

2022-06-22 Thread Jing Zhang
Hi Martijin,
This is really exciting news.
Thanks a lot for the effort to improve collaboration and communication with
the Calcite community.

> My take away from the discussion in the Flink community and the discussion
in the Calcite community is that I believe we should do 3 things.

Agreed on these 3 points.
About keeping up with the Calcite updates, I would like to take this issue.
Is it too late to schedule the 1.16 version? How about scheduling this work
on version 1.17?

Best,
Jing Zhang


Martijn Visser  于2022年6月23日周四 02:01写道:

> Hi everyone,
>
> I've recently reached out to the Calcite community to see if we could
> somehow get something done with regards to the PR that Jing Zhang had
> opened a long time ago. In that thread, I also mentioned that we had a
> discussion in the Flink community on potentially forking Calcite. I would
> recommend reading up on the thread [1]. Specifically the replies from other
> projects/PMCs (Apache Drill, Apache Dremio) are super interesting. These
> projects have forked Calcite in the past, regret that move, have reverted
> back to Calcite / are in the process of reverting and are elaborating on
> that. This thread also gained some traction on Twitter in case you're
> interested in more opinions. [3]
>
> My take away from the discussion in the Flink community and the discussion
> in the Calcite community is that I believe we should do 3 things:
>
> 1. We should not fork Calcite. There might be short term benefits but long
> term pain. I think we already are suffering from enough long term pain in
> the Flink codebase that we shouldn't take a step that will increase that
> pain even more, scattered over multiple places.
> 2. I think we should try to help out the Calcite community more. Not only
> by opening new PRs for new features, but we can also help by reviewing
> those PRs, reviewing other PRs that could be relevant for Flink or propose
> improvements given our experience at Flink. As you can see in the Calcite
> thread, Timo has already expressed desire in doing so. Part of the OSS
> community is also about helping each other; if we improve Calcite, we will
> also improve Flink.
> 3. I think we need to prioritise keeping up with the Calcite updates. They
> are currently working on releasing version 1.31, while Flink is still at
> 1.26.0. We don't necessarily need to stay in sync with the latest available
> version, but I definitely think we should be at most 2 versions (and
> preferably 1 version) behind (so currently that would be 1.28 and 1.29
> soonish). Not only are we increasing our own tech debt by not updating, we
> are also limiting ourselves in adding new features in the Table/SQL space.
> As you can also see for the 1.26 release notes, there's a warning to only
> use 1.26 for development since it can corrupt your data [3]. There are
> already multiple upgrade tickets for Calcite [4] [5] [6].
>
> [1] https://lists.apache.org/thread/3lkfhwjpqwy9pfhnvwmfkwmwlfyqs45z
> [2]
>
> https://twitter.com/gunnarmorling/status/1539499415337111553?s=21=8fGk3PxScOx4FJPJWE5UeA
> [3] https://calcite.apache.org/news/2020/10/06/release-1.26.0/
> [4] https://issues.apache.org/jira/browse/FLINK-20873
> [5] https://issues.apache.org/jira/browse/FLINK-21239
> [6] https://issues.apache.org/jira/browse/FLINK-27998
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
> https://github.com/MartijnVisser
>
> Op do 5 mei 2022 om 10:34 schreef godfrey he :
>
> > Hi, Timo & Martijn,
> >
> > Sorry for the late reply, thanks for the feedback.
> >
> > I strongly agree that the best solution would be to cooperate more
> > with the Calcite community
> > and maintain all new features and bug fixes in the Calcite community,
> > without any forking.
> > It is a long-term process. I think it's difficult to change community
> > rules, because the Calcite
> > project is a neutral lib that serves multiple projects simultaneously.
> > I don't think fork calcite is the perfect solution, but rather a
> > better balance within limited resources:
> > it's possible to introduce some necessary minor features and bug fixes
> > without having to
> > upgrade to the latest version.
> >
> >
> > I investigate other projects that use Calcite[1] and find that most of
> > them do not use
> > the latest version of the Calcite. Even for the Kylin community, the
> > version, based on
> > Calcite-1.16.0 has been updated to 70[2]. (Similar projects are quark and
> > drill)
> > My guess is that these projects choosed a stable version,
> > (or even choose to maintain a fork project), to maintain the stability.
> > When Flink does not need to introduce new syntax

Re: [VOTE] FLIP-168: Speculative Execution for Batch Job

2022-06-20 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Yun Gao  于2022年6月20日周一 14:25写道:

> +1 (binding)
>
> Thanks for driving the FLIP!
>
> Best,
> Yun Gao
>
>
> --
> From:Guowei Ma 
> Send Time:2022 May 27 (Fri.) 11:16
> To:dev 
> Subject:Re: [VOTE] FLIP-168: Speculative Execution for Batch Job
>
> +1 (binding)
> Best,
> Guowei
>
>
> On Fri, May 27, 2022 at 12:41 AM Shqiprim Bunjaku <
> shqiprimbunj...@gmail.com>
> wrote:
>
> > +1 (non-binding)
> >
> > Best Regards
> > Shqiprim
> >
> > On Thu, May 26, 2022 at 6:22 PM rui fan <1996fan...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > +1(non-binding), it’s very useful for batch job stability.
> > >
> > > Best wishes
> > > fanrui
> > >
> > > On Thu, May 26, 2022 at 15:56 Zhu Zhu  wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Thanks for the feedback for FLIP-168: Blocklist Mechanism [1] on the
> > > > discussion thread [2].
> > > >
> > > > I'd like to start a vote for it. The vote will last for at least 72
> > hours
> > > > unless there is an objection or insufficient votes.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-168%3A+Speculative+execution+for+Batch+Job
> > > > [2] https://lists.apache.org/thread/ot352tp8t7mclzx9zfv704gcm0fwrq58
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Jingsong Lee

2022-06-13 Thread Jing Zhang
Congratulations, Jingsong!

Best,
Jing Zhang

Leonard Xu  于2022年6月14日周二 10:54写道:

> Congratulations, Jingsong!
>
>
> Best,
> Leonard
>
> > 2022年6月13日 下午6:52,刘首维  写道:
> >
> > Congratulations and well deserved, Jingsong!
> >
> >
> > Best regards,
> > Shouwei
> > --原始邮件--
> > 发件人:
> "dev"
><mailto:luoyu...@alumni.sjtu.edu.cn>;
> > 发送时间:2022年6月13日(星期一) 晚上6:09
> > 收件人:"dev"mailto:dev@flink.apache.org>;
> >
> > 主题:Re: [ANNOUNCE] New Apache Flink PMC Member - Jingsong Lee
> >
> >
> >
> > Congratulations, Jingsong!
> >
> >
> > Best regards,
> > Yuxia
> >
> > - 原始邮件 -
> > 发件人: "Yun Tang"  > 收件人: "dev"  > 发送时间: 星期一, 2022年 6 月 13日 下午 6:12:24
> > 主题: Re: [ANNOUNCE] New Apache Flink PMC Member - Jingsong Lee
> >
> > Congratulations, Jingsong! Well deserved.
> >
> > Best
> > Yun Tang
> > 
> > From: Xingbo Huang  > Sent: Monday, June 13, 2022 17:39
> > To: dev  > Subject: Re: [ANNOUNCE] New Apache Flink PMC Member - Jingsong Lee
> >
> > Congratulations, Jingsong!
> >
> > Best,
> > Xingbo
> >
> > Jane Chan  >
> >  Congratulations, Jingsong!
> > 
> >  Best,
> >  Jane Chan
> > 
> >  On Mon, Jun 13, 2022 at 4:43 PM Shuo Cheng  <mailto:njucs...@gmail.com> wrote:
> > 
> >   Congratulations, Jingsong!
> >  
> >   On 6/13/22, Paul Lam  paullin3...@gmail.com> wrote:
> >Congrats, Jingsong! Well deserved!
> >   
> >Best,
> >Paul Lam
> >   
> >2022年6月13日 16:31,Lincoln Lee  <mailto:lincoln.8...@gmail.com> 写道:
> >   
> >Congratulations, Jingsong!
> >   
> >Best,
> >Lincoln Lee
> >   
> >   
> >Jark Wu mailto:imj...@gmail.com>
> 于2022年6月13日周一 16:29写道:
> >   
> >Congrats, Jingsong!
> >   
> >Cheers,
> >Jark
> >   
> >On Mon, 13 Jun 2022 at 16:16, Jiangang Liu <
> >  liujiangangp...@gmail.com <mailto:liujiangangp...@gmail.com>
> >wrote:
> >   
> >Congratulations, Jingsong!
> >   
> >Best,
> >Jiangang Liu
> >   
> >Martijn Visser  <mailto:martijnvis...@apache.org> 于2022年6月13日周一 16:06写道:
> >   
> >Like everyone has mentioned, this is very
> well deserved.
> >Congratulations!
> >   
> >Op ma 13 jun. 2022 om 09:57 schreef
> Benchao Li <
> >  libenc...@apache.org <mailto:libenc...@apache.org>
> >   :
> >   
> >Congratulations, Jingsong! Well
> deserved.
> >   
> >Rui Fan <1996fan...@gmail.com
> <mailto:1996fan...@gmail.com> 于2022年6月13日周一 15:53写道:
> >   
> >Congratulations, Jingsong!
> >   
> >Best,
> >Rui Fan
> >   
> >On Mon, Jun 13, 2022 at 3:40 PM
> LuNing Wang <
> >  wang4lun...@gmail.com <mailto:wang4lun...@gmail.com>
> >   
> >wrote:
> >   
> >Congratulations, Jingsong!
> >   
> >Best,
> >LuNing Wang
> >   
> >Ingo Bürk <
> airbla...@apache.org <mailto:airbla...@apache.org> 于2022年6月13日周一
> 15:36写道:
> >   
> >Congrats, Jingsong!
> >   
> >On 13.06.22 08:58, Becket
> Qin wrote:
> >Hi all,
> >   
> >I'm very happy to
> announce that Jingsong Lee has joined the
> >Flink
> >PMC!
> >   
> >Jingsong became a
> Flink committer in Feb 2020 and has been
> >continuously
> >contributing to the
> project since then, mainly in Flink SQL.
> >He
> >has
> >been
> >quite active in the
> mailing list, fixing bugs, helping
> >verifying
> >releases,
> >reviewing patches and
> FLIPs. Jingsong is also devoted to
> >pushing
> >Flink
> >SQL
> >to new use cases. He
> spent a lot of time in implementing the
> >Flink
> >connectors for Apache
> Iceberg. Jingsong is also the primary
> >driver
> >behind
> >the effort of
> flink-table-store, which aims to provide a
> >stream-batch
> >unified storage for
> Flink dynamic tables.
> >   
> >Congratulations and
> welcome, Jingsong!
> >   
> >Cheers,
> >   
> >Jiangjie (Becket) Qin
> >(On behalf of the
> Apache Flink PMC)
> >   
> >   
> >   
> >   
> >   
> >   
> >--
> >   
> >Best,
> >Benchao Li
> >   
> >   
> >   
> >   
> >   
> >   
> >  
> > 
>
>


[jira] [Created] (FLINK-28003) 'pos ' field would be updated to 'POSITION' when use SqlClient

2022-06-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-28003:
--

 Summary: 'pos  ' field would be updated to 'POSITION' when use 
SqlClient
 Key: FLINK-28003
 URL: https://issues.apache.org/jira/browse/FLINK-28003
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Client
Affects Versions: 1.15.0
Reporter: Jing Zhang
 Attachments: zj_test.sql

When I run the following sql in SqlClient using 'sql-client.sh -f zj_test.sql'
{code:java}
create table if not exists db.zj_test(
pos                   int,
rank_cmd              string
)
partitioned by (
`p_date` string,
`p_hourmin` string);

INSERT OVERWRITE TABLE db.zj_test PARTITION (p_date='20220605', p_hourmin = 
'0100')
SELECT
pos ,
rank_cmd
FROM db.sourceT
where p_date = '20220605' and p_hourmin = '0100'; {code}
An error would be thrown out because the 'pos' field is changed to 'POSITION'. 
I guess `SqlCompleter` in sqlClient module must do something here.

The error could be reproduced using the attached file.

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [VOTE] FLIP-231: Introduce SupportStatisticReport to support reporting statistics from source connectors

2022-06-09 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Martijn Visser  于2022年6月9日周四 14:58写道:

> +1 (binding)
>
> Op do 9 jun. 2022 om 04:31 schreef Jingsong Li :
>
> > +1 (binding)
> >
> > Best,
> > Jingsong
> >
> > On Tue, Jun 7, 2022 at 5:20 PM Jark Wu  wrote:
> > >
> > > +1 (binding)
> > >
> > > Best,
> > > Jark
> > >
> > > On Tue, 7 Jun 2022 at 13:44, Jing Ge  wrote:
> > >
> > > > Hi Godfrey,
> > > >
> > > > +1 (non-binding)
> > > >
> > > > Best regards,
> > > > Jing
> > > >
> > > >
> > > > On Tue, Jun 7, 2022 at 4:42 AM godfrey he 
> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Thanks for all the feedback so far. Based on the discussion[1] we
> > seem
> > > > > to have consensus, so I would like to start a vote on FLIP-231 for
> > > > > which the FLIP has now also been updated[2].
> > > > >
> > > > > The vote will last for at least 72 hours (Jun 10th 12:00 GMT)
> unless
> > > > > there is an objection or insufficient votes.
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/88kxk7lh8bq2s2c2qrf06f3pnf9fkxj2
> > > > > [2]
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/pages/resumedraft.action?draftId=211883860=eda17eaa-43f9-4dc1-9a7d-3a9b5a4bae00;
> > > > >
> > > > > Best,
> > > > > Godfrey
> > > > >
> > > >
> >
>


Re: [VOTE] FLIP-226: Introduce Schema Evolution on Table Store

2022-05-15 Thread Jing Zhang
+1
Thanks @ Jingsong for driving this topic.

Best,
Jing Zhang

Jingsong Li  于2022年5月12日周四 17:06写道:

> Hi, everyone
>
> Thanks all for your attention to FLIP-226: Introduce Schema Evolution on
> Table Store [1] and participation in the discussion in the mail thread [2].
>
> I'd like to start a vote for it. The vote will be open for at least 72
> hours unless there is an objection or not enough votes.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-226%3A+Introduce+Schema+Evolution+on+Table+Store
> [2] https://lists.apache.org/thread/sls26s8y55tfh59j2dqkgczml6km49jx
>


Re: Re: [VOTE] FLIP-216: Introduce pluggable dialect and plan for migrating Hive dialect

2022-04-28 Thread Jing Zhang
+1 (binding)

Best,
Jing Zhang

Martijn Visser  于2022年4月28日周四 20:11写道:

> +1 (Binding)
>
> On Thu, 28 Apr 2022 at 13:40, ron  wrote:
>
> > +1
> >
> > Best,
> > Ron
> >
> >
> > > -原始邮件-
> > > 发件人: "Jark Wu" 
> > > 发送时间: 2022-04-28 15:46:22 (星期四)
> > > 收件人: dev 
> > > 抄送:
> > > 主题: Re: [VOTE] FLIP-216: Introduce pluggable dialect and plan for
> > migrating Hive dialect
> > >
> > > +1 (binding)
> > >
> > > Thank Yuxia, for driving this work.
> > >
> > > Best,
> > > Jark
> > >
> > > On Thu, 28 Apr 2022 at 11:58, Jingsong Li 
> > wrote:
> > >
> > > > +1 (Binding)
> > > >
> > > > A very good step to move forward.
> > > >
> > > > Best.
> > > > Jingsong
> > > >
> > > > On Wed, Apr 27, 2022 at 9:33 PM yuxia 
> > wrote:
> > > > >
> > > > > Hi, everyone
> > > > >
> > > > > Thanks all for attention to FLIP-216: Introduce pluggable dialect
> and
> > > > plan for migrating Hive dialect [1] and participation in the
> > discussion in
> > > > the mail thread [2].
> > > > >
> > > > > I'd like to start a vote for it. The vote will be open for at least
> > 72
> > > > hours unless there is an objection or not enough votes.
> > > > >
> > > > > [1] [
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-216%3A++Introduce+pluggable+dialect+and+plan+for+migrating+Hive+dialect
> > > > |
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-216%3A++Introduce+pluggable+dialect+and+plan+for+migrating+Hive+dialect
> > > > ]
> > > > > [2] [
> > https://lists.apache.org/thread/66g79w5zlod2ylyv8k065j57pjjmv1jo
> > > > | https://lists.apache.org/thread/66g79w5zlod2ylyv8k065j57pjjmv1jo ]
> > > > >
> > > > >
> > > > > Best regards,
> > > > > Yuxia
> > > >
> >
> >
> > --
> > Best,
> > Ron
> >
>


Re: [DISCUSS] Maintain a Calcite repository for Flink to accelerate the development for Flink SQL features

2022-04-23 Thread Jing Zhang
Hi Godfrey,
I would like to share some problems based on my past experience.
1.  It's not easy to push new features in the CALCITE community.
As @Martijn referred, https://issues.apache.org/jira/browse/CALCITE-4865 /
https://github.com/apache/calcite/pull/2606 is such an example.
I tried out many ways, for example, sent review requests in the dev mail
list, left comments in JIRA and in pull requests.
And had to give up finally. Sorry for that.
2. However,  some new features of calcite are radical.
Such as https://issues.apache.org/jira/browse/CALCITE-4173, which had some
strong opposition in the CALCITE community,
But it was merged finally and caused  unexpected problems, such as wrong
results (https://issues.apache.org/jira/browse/FLINK-24708)
and other related bugs.
3. Every time we upgrade the calcite version, we will cross multiple
versions, resulting in a slow upgrade process and
uncontrolled results, often causing some unexpected problems.

Thank @Godfrey for driving this discussion in a big scope.
I think it's a good chance to review these problems and find a solution.

Best,
Jing Zhang

godfrey he  于2022年4月22日周五 21:40写道:

> Hi Chesnay,
>
> There is no bug fix version until now.
> You can find the releases in https://github.com/apache/calcite/tags
>
> Best,
> Godfrey
>
> Chesnay Schepler  于2022年4月22日周五 18:48写道:
> >
> > I find it a bit weird that the supposed only way to get a bug fix is to
> > do a big version upgrade.
> > Is Calcite not creating bugfix releases?
> >
> > On 22/04/2022 12:26, godfrey he wrote:
> > > Thanks for the feedback, guys!
> > >
> > > For Jingsong's feedback:
> > >> ## Do we have the plan to upgrade calcite to 1.31?
> > > I think we will upgrade Calcite to 1.31 only when Flink depends on
> > > some significant features of Calcite.
> > >   Such as: new syntax PTF (CALCITE-4865).
> > >
> > >   >## Is Cherry-pick costly?
> > > >From the experience of maintaining calcite with our company, the cost
> is small.
> > > We only cherry-pick the bug fixes and needed minor features.
> > > For a major feature, we can choose to upgrade the version.
> > >
> > >> ## Are the calcite repository costly to maintain?
> > > >From the experience of @Dann y chen (One PMC of Calcite), publishing
> > > is much easier.
> > >
> > >
> > > For Chesnay's feedback:
> > > I also totally agree that a fork repository will increase the cost of
> > > maintenance.
> > >
> > > Usually, the Calcite community releases a version three months or more.
> > > I think it's hard to let Calcite change the release cycle
> > > because Calcite supports many compute engines.
> > >
> > >
> > > For Konstantin's feedback:
> > > Some changes in Calcite may cause hundreds of plan changes in Flink,
> > > such as: CALCITE-4173.
> > > We must check whether the change is expected, whether there is
> > > performance regression.
> > > Some of the changes are very subtle, especially in the CBO planner.
> > > This situation also occurs similarly within upgrading from 1.1x to
> 1.22.
> > > If you are not familiar with Flink planner and Calcite, it will be
> > > more difficult to upgrade.
> > >
> > >
> > > For Xintong's feedback:
> > > You are right, I will connect Yun for some help, Thanks for the
> suggestions.
> > >
> > >
> > > For Martijn's feedback:
> > > I'm also against cherry-pick many features code into the fock
> repository,
> > > and I also totally agree we should collaborate closely with the
> > > Calcite community.
> > > I'm just trying to find an approach which can avoid frequent Calcite
> upgrades,
> > > but easily support bug fix and minor new feature development.
> > >
> > > As for the CALCITE-4865 case, I think we should upgrade the Calcite
> > > version to support PTF.
> > >
> > > @Jing zhang, can you share some 'feeling' for CALCITE-4865 ?
> > >
> > > Best,
> > > Godfrey
> > >
> > > Martijn Visser  于2022年4月22日周五 17:31写道:
> > >> Hi everyone,
> > >>
> > >> Overall I'm against the idea of setting up a Calcite fork for the same
> > >> reasons that Chesnay has mentioned. We've talked extensively about
> doing an
> > >> upgrade of Calcite during the Flink 1.15 release period, but there
> was a
> > >> lot of pushback by the maintainers against that because of the
> required
> > >> efforts. Having our own fork will mean that there 

Re: [VOTE] FLIP-214: Support Advanced Function DDL

2022-04-21 Thread Jing Zhang
Ron,
+1 (binding)

Thanks for driving this FLIP.

Best,
Jing Zhang

Jark Wu  于2022年4月21日周四 11:31写道:

> Thanks for driving this work @Ron,
>
> +1 (binding)
>
> Best,
> Jark
>
> On Thu, 21 Apr 2022 at 10:42, Mang Zhang  wrote:
>
> > +1
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> >
> > Best regards,
> > Mang Zhang
> >
> >
> >
> >
> >
> > At 2022-04-20 18:28:28, "刘大龙"  wrote:
> > >Hi, everyone
> > >
> > >
> > >
> > >
> > >I'd like to start a vote on FLIP-214: Support Advanced Function DDL [1]
> > which has been discussed in [2].
> > >
> > >The vote will be open for at least 72 hours unless there is an objection
> > or not enough votes.
> > >
> > >
> > >
> > >
> > >[1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-214+Support+Advanced+Function+DDL
> > >
> > >[2] https://lists.apache.org/thread/7m5md150qgodgz1wllp5plx15j1nowx8
> > >
> > >
> > >
> > >
> > >Best,
> > >
> > >Ron
> >
>


Re: [VOTE] Release 1.15.0, release candidate #1

2022-04-12 Thread Jing Zhang
Hi, Yun Gao
There is a new bug [1] introduced in release-1.15.
It's better to be fixed in 1.15.0 version.
I'm terribly sorry to merge the pull request too late.

[1] https://issues.apache.org/jira/browse/FLINK-26681

Best,
Jing Zhang

Johannes Moser  于2022年4月12日周二 14:53写道:

> Here is the missing link to the announcement blogpost [6]
>
> [6] https://github.com/apache/flink-web/pull/526
>
> > On 11.04.2022, at 20:12, Yun Gao  wrote:
> >
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> 1.15.0, as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be deployed to dist.apache.org [2], which are signed with the key with
> fingerprint CBE82BEFD827B08AFA843977EDBF922A7BC84897 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "release-1.15.0-rc1" [5],
> > * website pull request listing the new release and adding announcement
> blog post [6].
> > The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> > Thanks,
> > Joe, Till and Yun Gao
> > [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12350442
> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.15.0-rc1/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1494/
> > [5] https://github.com/apache/flink/releases/tag/release-1.15.0-rc1[6]
> https://github.com/apache/flink-web/pull/526
>
>


[jira] [Created] (FLINK-27086) Add a QA about how to handle exception when use hive parser in hive dialect document

2022-04-06 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-27086:
--

 Summary: Add a QA about how to handle exception when use hive 
parser in hive dialect document
 Key: FLINK-27086
 URL: https://issues.apache.org/jira/browse/FLINK-27086
 Project: Flink
  Issue Type: Technical Debt
  Components: Documentation
Affects Versions: 1.15.0
Reporter: Jing Zhang
 Fix For: 1.15.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27020) use hive dialect in SqlClient would thrown an error based on 1.15 version

2022-04-02 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-27020:
--

 Summary: use hive dialect in SqlClient would thrown an error based 
on 1.15 version
 Key: FLINK-27020
 URL: https://issues.apache.org/jira/browse/FLINK-27020
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Client
Affects Versions: 1.15.0
Reporter: Jing Zhang
 Attachments: image-2022-04-02-20-28-01-335.png

I use 1.15 rc0 and encounter a problem.
An error would be thrown out if I use hive dialect in SqlClient.
 !image-2022-04-02-20-28-01-335.png! 
And I already add flink-sql-connector-hive-2.3.6_2.12-1.15-SNAPSHOT.jar.
I note that, load and use hive module could work fine, but use hive dialect 
would fail.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27019) use hive dialect in SqlClient would thrown an error based on 1.15 version

2022-04-02 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-27019:
--

 Summary: use hive dialect in SqlClient would thrown an error based 
on 1.15 version
 Key: FLINK-27019
 URL: https://issues.apache.org/jira/browse/FLINK-27019
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Client
Affects Versions: 1.15.0
Reporter: Jing Zhang
 Attachments: image-2022-04-02-20-25-25-169.png

I use 1.15 rc0 and encounter a problem.
An error would be thrown out if I use hive dialect in SqlClient.
 !image-2022-04-02-20-25-25-169.png! 

And I already add flink-sql-connector-hive-2.3.6_2.12-1.15-SNAPSHOT.jar.
I note that, load and use hive module could work fine, but use hive dialect 
would fail.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [ANNOUNCE] New PMC member: Yuan Mei

2022-03-14 Thread Jing Zhang
Congratulations, Yuan!

Best,
Jing Zhang

Jing Ge  于2022年3月14日周一 18:15写道:

> Congrats! Very well deserved!
>
> Best,
> Jing
>
> On Mon, Mar 14, 2022 at 10:34 AM Piotr Nowojski 
> wrote:
>
> > Congratulations :)
> >
> > pon., 14 mar 2022 o 09:59 Yun Tang  napisał(a):
> >
> > > Congratulations, Yuan!
> > >
> > > Best,
> > > Yun Tang
> > > 
> > > From: Zakelly Lan 
> > > Sent: Monday, March 14, 2022 16:55
> > > To: dev@flink.apache.org 
> > > Subject: Re: [ANNOUNCE] New PMC member: Yuan Mei
> > >
> > > Congratulations, Yuan!
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Mon, Mar 14, 2022 at 4:49 PM Johannes Moser 
> > wrote:
> > >
> > > > Congrats Yuan.
> > > >
> > > > > On 14.03.2022, at 09:45, Arvid Heise  wrote:
> > > > >
> > > > > Congratulations and well deserved!
> > > > >
> > > > > On Mon, Mar 14, 2022 at 9:30 AM Matthias Pohl 
> > > wrote:
> > > > >
> > > > >> Congratulations, Yuan.
> > > > >>
> > > > >> On Mon, Mar 14, 2022 at 9:25 AM Shuo Cheng 
> > > wrote:
> > > > >>
> > > > >>> Congratulations, Yuan!
> > > > >>>
> > > > >>> On Mon, Mar 14, 2022 at 4:22 PM Anton Kalashnikov <
> > > kaa@yandex.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Congratulations, Yuan!
> > > > >>>>
> > > > >>>> --
> > > > >>>>
> > > > >>>> Best regards,
> > > > >>>> Anton Kalashnikov
> > > > >>>>
> > > > >>>> 14.03.2022 09:13, Leonard Xu пишет:
> > > > >>>>> Congratulations Yuan!
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>> Leonard
> > > > >>>>>
> > > > >>>>>> 2022年3月14日 下午4:09,Yangze Guo  写道:
> > > > >>>>>>
> > > > >>>>>> Congratulations!
> > > > >>>>>>
> > > > >>>>>> Best,
> > > > >>>>>> Yangze Guo
> > > > >>>>>>
> > > > >>>>>> On Mon, Mar 14, 2022 at 4:08 PM Martijn Visser <
> > > > >>>> martijnvis...@apache.org> wrote:
> > > > >>>>>>> Congratulations Yuan!
> > > > >>>>>>>
> > > > >>>>>>> On Mon, 14 Mar 2022 at 09:02, Yu Li 
> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi all!
> > > > >>>>>>>>
> > > > >>>>>>>> I'm very happy to announce that Yuan Mei has joined the
> Flink
> > > PMC!
> > > > >>>>>>>>
> > > > >>>>>>>> Yuan is helping the community a lot with creating and
> > validating
> > > > >>>> releases,
> > > > >>>>>>>> contributing to FLIP discussions and good code contributions
> > to
> > > > >> the
> > > > >>>>>>>> state backend and related components.
> > > > >>>>>>>>
> > > > >>>>>>>> Congratulations and welcome, Yuan!
> > > > >>>>>>>>
> > > > >>>>>>>> Best Regards,
> > > > >>>>>>>> Yu (On behalf of the Apache Flink PMC)
> > > > >>>>>>>>
> > > > >>>> --
> > > > >>>>
> > > > >>>> Best regards,
> > > > >>>> Anton Kalashnikov
> > > > >>>>
> > > > >>>>
> > > > >>
> > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Martijn Visser

2022-03-08 Thread Jing Zhang
Congratulations, Martin!

Best,
Jing Zhang

Zhilong Hong  于2022年3月5日周六 01:17写道:

> Congratulations, Martin! Well deserved!
>
> Best,
> Zhilong
>
> On Sat, Mar 5, 2022 at 1:09 AM Piotr Nowojski 
> wrote:
>
> > Congratulations!
> >
> > Piotrek
> >
> > pt., 4 mar 2022 o 16:05 Aitozi  napisał(a):
> >
> > > Congratulations Martjin!
> > >
> > > Best,
> > > Aitozi
> > >
> > > Martijn Visser  于2022年3月4日周五 17:10写道:
> > >
> > > > Thank you all!
> > > >
> > > > On Fri, 4 Mar 2022 at 10:00, Niels Basjes  wrote:
> > > >
> > > > > Congratulations Martjin!
> > > > >
> > > > > On Fri, Mar 4, 2022 at 9:43 AM Johannes Moser 
> > > wrote:
> > > > >
> > > > > > Congratulations Martijn,
> > > > > >
> > > > > > Well deserved.
> > > > > >
> > > > > > > On 03.03.2022, at 16:49, Robert Metzger 
> > > wrote:
> > > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > On behalf of the PMC, I'm very happy to announce Martijn Visser
> > as
> > > a
> > > > > new
> > > > > > > Flink committer.
> > > > > > >
> > > > > > > Martijn is a very active Flink community member, driving a lot
> of
> > > > > efforts
> > > > > > > on the dev@flink mailing list. He also pushes projects such as
> > > > > replacing
> > > > > > > Google Analytics with Matomo, so that we can generate our web
> > > > analytics
> > > > > > > within the Apache Software Foundation.
> > > > > > >
> > > > > > > Please join me in congratulating Martijn for becoming a Flink
> > > > > committer!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Robert
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Best regards / Met vriendelijke groeten,
> > > > >
> > > > > Niels Basjes
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - David Morávek

2022-03-07 Thread Jing Zhang
Congratulations David!

Ryan Skraba  于2022年3月7日周一 22:18写道:

> Congratulations David!
>
> On Mon, Mar 7, 2022 at 9:54 AM Jan Lukavský  wrote:
>
> > Congratulations David!
> >
> >   Jan
> >
> > On 3/7/22 09:44, Etienne Chauchot wrote:
> > > Congrats David !
> > >
> > > Well deserved !
> > >
> > > Etienne
> > >
> > > Le 07/03/2022 à 08:47, David Morávek a écrit :
> > >> Thanks everyone!
> > >>
> > >> Best,
> > >> D.
> > >>
> > >> On Sun 6. 3. 2022 at 9:07, Yuan Mei  wrote:
> > >>
> > >>> Congratulations, David!
> > >>>
> > >>> Best Regards,
> > >>> Yuan
> > >>>
> > >>> On Sat, Mar 5, 2022 at 8:13 PM Roman Khachatryan 
> > >>> wrote:
> > >>>
> >  Congratulations, David!
> > 
> >  Regards,
> >  Roman
> > 
> >  On Fri, Mar 4, 2022 at 7:54 PM Austin Cawley-Edwards
> >   wrote:
> > > Congrats David!
> > >
> > > On Fri, Mar 4, 2022 at 12:18 PM Zhilong Hong  >
> >  wrote:
> > >> Congratulations, David!
> > >>
> > >> Best,
> > >> Zhilong
> > >>
> > >> On Sat, Mar 5, 2022 at 1:09 AM Piotr Nowojski <
> pnowoj...@apache.org
> > >
> > >> wrote:
> > >>
> > >>> Congratulations :)
> > >>>
> > >>> pt., 4 mar 2022 o 16:04 Aitozi 
> napisał(a):
> > >>>
> >  Congratulations David!
> > 
> >  Ingo Bürk  于2022年3月4日周五 22:56写道:
> > 
> > > Congrats, David!
> > >
> > > On 04.03.22 12:34, Robert Metzger wrote:
> > >> Hi everyone,
> > >>
> > >> On behalf of the PMC, I'm very happy to announce David
> > >>> Morávek
> >  as a
> > >>> new
> > >> Flink committer.
> > >>
> > >> His first contributions to Flink date back to 2019. He has
> > >>> been
> > >> increasingly active with reviews and driving major
> > >>> initiatives
> >  in
> > >> the
> > >> community. David brings valuable experience from being a
> >  committer
> > >> in
> >  the
> > >> Apache Beam project to Flink.
> > >>
> > >>
> > >> Please join me in congratulating David for becoming a Flink
> > >>> committer!
> > >> Cheers,
> > >> Robert
> > >>
> >
>


Re: [DISCUSS] Flink's supported APIs and Hive query syntax

2022-03-07 Thread Jing Zhang
Hi Martijn,

Thanks for driving this discussion.

+1 on efforts on more hive syntax compatibility.

With the efforts on batch processing in recent versions(1.10~1.15), many
users have run batch processing jobs based on Flink.
In our team, we are trying to migrate most of the existing online batch
jobs from Hive/Spark to Flink. We hope this migration does not require
users to modify their sql.
Although Hive is not as popular as it used to be, Hive SQL is still alive
because many users still use Hive SQL to run spark jobs.
Therefore, compatibility with more HIVE syntax is critical to this
migration work.

Best,
Jing Zhang



Martijn Visser  于2022年3月7日周一 19:23写道:

> Hi everyone,
>
> Flink currently has 4 APIs with multiple language support which can be used
> to develop applications:
>
> * DataStream API, both Java and Scala
> * Table API, both Java and Scala
> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
> * Python API
>
> Since FLIP-152 [1] the Flink SQL support has been extended to also support
> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
> more syntax compatibility issues.
>
> I would like to open a discussion on Flink directly supporting the Hive
> query syntax. I have some concerns if having a 100% Hive query syntax is
> indeed something that we should aim for in Flink.
>
> I can understand that having Hive query syntax support in Flink could help
> users due to interoperability and being able to migrate. However:
>
> - Adding full Hive query syntax support will mean that we go from 6 fully
> supported API/language combinations to 7. I think we are currently already
> struggling with maintaining the existing combinations, let another one
> more.
> - Apache Hive is/appears to be a project that's not that actively developed
> anymore. The last release was made in January 2021. It's popularity is
> rapidly declining in Europe and the United State, also due Hadoop becoming
> less popular.
> - Related to the previous topic, other software like Snowflake,
> Trino/Presto, Databricks are becoming more and more popular. If we add full
> support for the Hive query syntax, then why not add support for Snowflake
> and the others?
> - We are supporting Hive versions that are no longer supported by the Hive
> community with known security vulnerabilities. This makes Flink also
> vulnerable for those type of vulnerabilities.
> - The currently Hive implementation is done by using a lot of internals of
> Flink, making Flink hard to maintain, with lots of tech debt and making
> things overly complex.
>
> From my perspective, I think it would be better to not have Hive query
> syntax compatibility directly in Flink itself. Of course we should have a
> proper Hive connector and a proper Hive catalog to make connectivity with
> Hive (the versions that are still supported by the Hive community) itself
> possible. Alternatively, if Hive query syntax is so important, it should
> not rely on internals but be available as a dialect/pluggable option. That
> could also open up the possibility to add more syntax support for others in
> the future, but I really think we should just focus on Flink SQL itself.
> That's already hard enough to maintain and improve on.
>
> I'm looking forward to the thoughts of both Developers and Users, so I'm
> cross-posting to both mailing lists.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> [2] https://issues.apache.org/jira/browse/FLINK-21529
>


Re: Re: [ANNOUNCE] New Apache Flink Committers: Feng Wang, Zhipeng Zhang

2022-02-18 Thread Jing Zhang
Congratulations!

Best,
Jing Zhang

Benchao Li  于2022年2月18日周五 20:15写道:

> Congratulations!
>
> Jing Ge  于2022年2月18日周五 17:03写道:
>
> > Congrats! Very well deserved!
> >
> > Best regards,
> > Jing
> >
> > On Fri, Feb 18, 2022 at 9:16 AM Yu Li  wrote:
> >
> > > Congratulations!
> > >
> > > Best Regards,
> > > Yu
> > >
> > >
> > > On Fri, 18 Feb 2022 at 14:19, godfrey he  wrote:
> > >
> > > > Congratulations!
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Lincoln Lee  于2022年2月18日周五 14:05写道:
> > > > >
> > > > > Congratulations Feng and Zhipeng!
> > > > >
> > > > > Best,
> > > > > Lincoln Lee
> > > > >
> > > > >
> > > > > Yuepeng Pan  于2022年2月18日周五 12:45写道:
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Congratulations!
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Yuepeng Pan
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > 在 2022-02-18 12:41:18,"Jingsong Li"  写道:
> > > > > > >Congratulations!
> > > > > > >
> > > > > > >Best,
> > > > > > >Jingsong
> > > > > > >
> > > > > > >On Fri, Feb 18, 2022 at 9:47 AM Zhipeng Zhang <
> > > > zhangzhipe...@gmail.com>
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Thank you everyone for the warm welcome!
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Zhipeng
> > > > > > >>
> > > > > > >> Jinzhong Li  于2022年2月17日周四 19:58写道:
> > > > > > >>
> > > > > > >> > Congratulations!
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Best,
> > > > > > >> >
> > > > > > >> > Jinzhong
> > > > > > >> >
> > > > > > >> > Robert Metzger  于2022年2月16日周三 21:32写道:
> > > > > > >> >
> > > > > > >> > > Hi everyone,
> > > > > > >> > >
> > > > > > >> > > On behalf of the PMC, I'm very happy to announce two new
> > Flink
> > > > > > >> > > committers: Feng Wang and Zhipeng Zhang!
> > > > > > >> > >
> > > > > > >> > > Feng is one of the most active Flink evangelists in China,
> > > with
> > > > > > plenty of
> > > > > > >> > > public talks, blog posts and other evangelization
> > activities.
> > > > The
> > > > > > PMC
> > > > > > >> > wants
> > > > > > >> > > to recognize and value these efforts by making Feng a
> > > committer!
> > > > > > >> > >
> > > > > > >> > > Zhipeng Zhang has made significant contributions to
> > flink-ml,
> > > > like
> > > > > > most
> > > > > > >> > of
> > > > > > >> > > the FLIPs for our ML efforts.
> > > > > > >> > >
> > > > > > >> > > Please join me in welcoming them as committers!
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > Best,
> > > > > > >> > > Robert
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> best,
> > > > > > >> Zhipeng
> > > > > >
> > > >
> > >
> >
>
>
> --
>
> Best,
> Benchao Li
>


Re: [ANNOUNCE] New Flink PMC members: Igal Shilman, Konstantin Knauf and Yun Gao

2022-02-18 Thread Jing Zhang
Congratulations!

Best,
Jing Zhang

Benchao Li  于2022年2月18日周五 20:22写道:

> Congratulations!
>
> Yu Li  于2022年2月18日周五 19:34写道:
>
> > Congratulations!
> >
> > Best Regards,
> > Yu
> >
> >
> > On Fri, 18 Feb 2022 at 14:19, godfrey he  wrote:
> >
> > > Congratulations!
> > >
> > > Best,
> > > Godfrey
> > >
> > > Lincoln Lee  于2022年2月18日周五 14:07写道:
> > > >
> > > > Congratulations!
> > > >
> > > > Best,
> > > > Lincoln Lee
> > > >
> > > >
> > > > Jingsong Li  于2022年2月18日周五 12:42写道:
> > > >
> > > > > Congratulations!
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Thu, Feb 17, 2022 at 8:08 PM Jinzhong Li <
> > lijinzhong2...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Congratulations!
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Jinzhong
> > > > > >
> > > > > > On Wed, Feb 16, 2022 at 9:23 PM Robert Metzger <
> > rmetz...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I would like to formally announce a few new Flink PMC members
> on
> > > the
> > > > > dev@
> > > > > > > list. The PMC has not done a good job of always announcing new
> > PMC
> > > > > members
> > > > > > > (and committers) recently. I'll try to keep an eye on this in
> the
> > > > > future to
> > > > > > > improve the situation.
> > > > > > >
> > > > > > > Nevertheless, I'm very happy to announce some very active
> > community
> > > > > members
> > > > > > > as new PMC members:
> > > > > > >
> > > > > > > - Igal Shilman, added to the PMC in October 2021
> > > > > > > - Konstantin Knauf, added to the PMC in January 2022
> > > > > > > - Yun Gao, added to the PMC in February 2022
> > > > > > >
> > > > > > > Please join me in welcoming them to the Flink PMC!
> > > > > > >
> > > > > > > Best,
> > > > > > > Robert
> > > > > > >
> > > > >
> > >
> >
>
>
> --
>
> Best,
> Benchao Li
>


Re: [VOTE] Deprecate Per-Job Mode in Flink 1.15

2022-01-28 Thread Jing Zhang
+1 (binding)

Thanks Konstantin for driving this.

Best,
Jing Zhang

Chenya Zhang  于2022年1月29日周六 07:04写道:

> +1 (non-binding)
>
> On Fri, Jan 28, 2022 at 12:46 PM Thomas Weise  wrote:
>
> > +1 (binding)
> >
> > On Fri, Jan 28, 2022 at 9:27 AM David Morávek  wrote:
> >
> > > +1 (non-binding)
> > >
> > > D.
> > >
> > > On Fri 28. 1. 2022 at 17:53, Till Rohrmann 
> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Jan 28, 2022 at 4:57 PM Gabor Somogyi <
> > gabor.g.somo...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > We're intended to make tests when FLINK-24897
> > > > > <https://issues.apache.org/jira/browse/FLINK-24897> is fixed.
> > > > > In case of further issues we're going to create further jiras.
> > > > >
> > > > > BR,
> > > > > G
> > > > >
> > > > >
> > > > > On Fri, Jan 28, 2022 at 4:30 PM Konstantin Knauf <
> kna...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Based on the discussion in [1], I would like to start a vote on
> > > > > deprecating
> > > > > > per-job mode in Flink 1.15. Consequently, we would target to drop
> > it
> > > in
> > > > > > Flink 1.16 or Flink 1.17 latest.
> > > > > >
> > > > > > The only limitation that would block dropping Per-Job mode
> > mentioned
> > > in
> > > > > [1]
> > > > > > is tracked in https://issues.apache.org/jira/browse/FLINK-24897.
> > In
> > > > > > general, the implementation of application mode in YARN should be
> > on
> > > > par
> > > > > > with the standalone and Kubernetes before we drop per-job mode.
> > > > > >
> > > > > > The vote will last for at least 72 hours, and will be accepted
> by a
> > > > > > consensus of active committers.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Konstantin
> > > > > >
> > > > > > [1]
> > https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Konstantin Knauf
> > > > > >
> > > > > > https://twitter.com/snntrable
> > > > > >
> > > > > > https://github.com/knaufk
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Releasing Flink 1.13.6

2022-01-25 Thread Jing Zhang
+1 for  releasing Flink 1.13.6
Thanks Martijn and Konstantin for driving this.

Best,
Jing Zhang

David Morávek  于2022年1月25日周二 19:13写道:

> Thanks for driving this Martijn, +1 for the release
>
> Also big thanks to Konstantin for volunteering
>
> Best,
> D.
>
> On Mon, Jan 24, 2022 at 3:24 PM Till Rohrmann 
> wrote:
>
> > +1 for the 1.13.6 release and thanks for volunteering Konstantin.
> >
> > Cheers,
> > Till
> >
> > On Mon, Jan 24, 2022 at 2:57 PM Konstantin Knauf 
> > wrote:
> >
> > > Thanks for starting the discussion and +1 to releasing.
> > >
> > > I am happy to manage the release aka learn how to do this.
> > >
> > > Cheers,
> > >
> > > Konstantin
> > >
> > > On Mon, Jan 24, 2022 at 2:52 PM Martijn Visser 
> > > wrote:
> > >
> > > > I would like to start a discussion on releasing Flink 1.13.6. Flink
> > > 1.13.5
> > > > was the latest release on the 16th of December, which was the
> emergency
> > > > release for the Log4j CVE [1]. Flink 1.13.4 was cancelled, leaving
> > Flink
> > > > 1.13.3 as the last real bugfix release. This one was released on the
> > 19th
> > > > of October last year.
> > > >
> > > > Since then, there have been 61 fixed tickets, excluding the test
> > > > stabilities [3]. This includes a blocker and a couple of critical
> > issues.
> > > >
> > > > Is there a PMC member who would like to manage the release? I'm more
> > than
> > > > happy to help with monitoring the status of the tickets.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn Visser
> > > > https://twitter.com/MartijnVisser82
> > > >
> > > > [1]
> https://flink.apache.org/news/2021/12/16/log4j-patch-releases.html
> > > > [2] https://flink.apache.org/news/2021/10/19/release-1.13.3.html
> > > > [3] JQL filter: project = FLINK AND resolution = Fixed AND
> fixVersion =
> > > > 1.13.6 AND labels != test-stability ORDER BY priority DESC, created
> > DESC
> > > >
> > >
> > >
> > > --
> > >
> > > Konstantin Knauf
> > >
> > > https://twitter.com/snntrable
> > >
> > > https://github.com/knaufk
> > >
> >
>


Re: PR Review Request

2022-01-25 Thread Jing Zhang
Please ignore me.
I originally wanted to send it to calcite's dev mail list, but I sent it to
the wrong mail list.
I'm terribly sorry.

Jing Zhang  于2022年1月26日周三 14:55写道:

> Hi community,
> My apologies for interrupting.
> Anyone could help to review the pr
> https://github.com/apache/calcite/pull/2606?
> Thanks a lot.
>
> CALCITE-4865 is the first sub-task of CALCITE-4864. This Jira aims to
> extend existing Table function in order to support Polymorphic Table
> Function which is introduced as the part of ANSI SQL 2016.
>
> The brief change logs of the PR are:
>   - Update `Parser.jj` to support partition by clause and order by clause
> for input table with set semantics of PTF
>   - Introduce `TableCharacteristics` which contains three characteristics
> of input table of table function
>   - Update `SqlTableFunction` to add a method `tableCharacteristics`,  the
> method returns the table characteristics for the ordinal-th argument to
> this table function. Default return value is Optional.empty which means the
> ordinal-th argument is not table.
>   - Introduce `SqlSetSemanticsTable` which represents input table with set
> semantics of Table Function, its `SqlKind` is `SET_SEMANTICS_TABLE`
>   - Updates `SqlValidatorImpl` to validate only set semantic table of
> Table Function could have partition by and order by clause
>   - Update `SqlToRelConverter#substituteSubQuery` to parse subQuery which
> represents set semantics table.
>
> PR: https://github.com/apache/calcite/pull/2606
> JIRA: https://issues.apache.org/jira/browse/CALCITE-4865
> Parent JARA: https://issues.apache.org/jira/browse/CALCITE-4864
>
> Best,
> Jing Zhang
>


PR Review Request

2022-01-25 Thread Jing Zhang
Hi community,
My apologies for interrupting.
Anyone could help to review the pr
https://github.com/apache/calcite/pull/2606?
Thanks a lot.

CALCITE-4865 is the first sub-task of CALCITE-4864. This Jira aims to
extend existing Table function in order to support Polymorphic Table
Function which is introduced as the part of ANSI SQL 2016.

The brief change logs of the PR are:
  - Update `Parser.jj` to support partition by clause and order by clause
for input table with set semantics of PTF
  - Introduce `TableCharacteristics` which contains three characteristics
of input table of table function
  - Update `SqlTableFunction` to add a method `tableCharacteristics`,  the
method returns the table characteristics for the ordinal-th argument to
this table function. Default return value is Optional.empty which means the
ordinal-th argument is not table.
  - Introduce `SqlSetSemanticsTable` which represents input table with set
semantics of Table Function, its `SqlKind` is `SET_SEMANTICS_TABLE`
  - Updates `SqlValidatorImpl` to validate only set semantic table of Table
Function could have partition by and order by clause
  - Update `SqlToRelConverter#substituteSubQuery` to parse subQuery which
represents set semantics table.

PR: https://github.com/apache/calcite/pull/2606
JIRA: https://issues.apache.org/jira/browse/CALCITE-4865
Parent JARA: https://issues.apache.org/jira/browse/CALCITE-4864

Best,
Jing Zhang


[RESULT][VOTE] FLIP-204: Introduce Hash Lookup Join

2022-01-24 Thread Jing Zhang
Hi community,

The voting time for FLIP-204: Introduce Hash Lookup Join[1][2][3] has
passed. I'm closing the vote now.

FLIP-204 has been accepted. There were 7 approving, 4 of which are binding:

- Jark Wu (binding)
- Jingsong Li (binding)
- Wenlong  (non-binding)
- Yuan (non-binding)
- Godfrey (binding)
- Jing Zhang(binding)
- Zhang Mang (non-binding)

There are no disapproving votes.

Thanks everyone for feedback!

Best,
Jing Zhang

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join
[2] https://lists.apache.org/thread/swsg22lshh0ts41w4h9z4b0cgtfyfgxd
[3] https://lists.apache.org/thread/pmwbvrr46co6j0dcv18okk6p4p6pq8ln


Re: Re: [VOTE] FLIP-204: Introduce Hash Lookup Join

2022-01-24 Thread Jing Zhang
+1 (binding)


Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-20 Thread Jing Zhang
Hi Jingsong,
Thanks for the feedback.

> Is there a conclusion about naming here? (Maybe I missed something?)
Use USE_HASH or some other names. Slightly confusing in the FLIP.

'SHUFFLE_HASH' is final hint name, 'USE_HASH' is rejected. I've updated the
FLIP.

> And the problem of what to write inside the hint, as mentioned by Lincoln.

I agree with Lincolon to only include one 'build' side table name only.
Besides, Lookup Join only support dimension table as build table, it does
not support left input as build table because Lookup join is always
triggered by left side.

> I think maybe we can list the grammars of other distributed systems,
like Hive Spark(Databricks) Snowflake?

I add the grammars of other distributed systems(oracle, spark, impala, SQL
Server) in FLIP.

[1] Oracle USE_Hash hint
<https://docs.oracle.com/cd/B12037_01/server.101/b10752/hintsref.htm#5683>
SELECT /*+ USE_HASH(l h) */ *
  FROM orders h, order_items l
  WHERE l.order_id = h.order_id
AND l.order_id > 3500;


[2] Spark SHUFFLE_HASH hint
<https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-qry-select-hints.html>
SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;


[3] IMPALA SHUFFLE hint
<https://impala.apache.org/docs/build/html/topics/impala_hints.html>
SELECT straight_join weather.wind_velocity, geospatial.altitude
  FROM weather JOIN /* +SHUFFLE */ geospatial
  ON weather.lat = geospatial.lat AND weather.long = geospatial.long;


[4] SQL Server Hash Keyword
<https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-join?view=sql-server-ver15>
SELECT p.Name, pr.ProductReviewID FROM Production.Product AS p LEFT OUTER
HASH JOIN Production.ProductReview AS pr ON p.ProductID = pr.ProductID ORDER
 BY ProductReviewID DESC;


Hive does not have similar grammars because shuffle join is default join
behavior of Hive. it only have map join hint  grammars.

I didn't find the similar query hint in Snowflake yet.


> About `SHUFFLE_HASH(left_table, right_table)`, one case can be shared:

SELECT * FROM left_t
  JOIN right_1 ON ...
  JOIN right_2 ON ...
  JOIN right_3 ON ...

What if we want to use shuffle_hash for all three joints?

SELECT /*+ SHUFFLE_HASH('left_t', 'right_1', 'right_2', 'right_3') */ ?

It does not work, because the left input of the second join is not
'left_t' anymore. It is the output of the first join.

Good point.
As mentioned before, now SHUFFLE_HASH hint only requires to specify build
table name.
So in the above case,
SELECT /*+ SHUFFLE_HASH('right_1', 'right_2', 'right_3') */
  * FROM left_t
  JOIN right_1 ON ...
  JOIN right_2 ON ...
  JOIN right_3 ON
It means require shuffle on lookup join which contain dimension table with
name as 'right_1' or 'right_2' or 'right_3'.

WDYT?

Best,
Jing Zhang

Jingsong Li  于2022年1月20日周四 14:33写道:

> Hi Jing,
>
> Sorry for the late reply!
>
> Is there a conclusion about naming here? (Maybe I missed something?)
> Use USE_HASH or some other names. Slightly confusing in the FLIP.
>
> And the problem of what to write inside the hint, as mentioned by lincoln.
>
> I think maybe we can list the grammars of other distributed systems,
> like Hive Spark(Databricks) Snowflake?
>
> Best,
> Jingsong
>
> On Thu, Jan 20, 2022 at 1:56 PM Lincoln Lee 
> wrote:
> >
> > Hi, Jing,
> >Sorry for the late reply!  The previous discussion for the hint syntax
> > left a minor difference there: whether to use both sides of join table
> > names or just one 'build' side table name only. I would prefer the later
> > one.
> >  Users only need to pass the `build` side table(usually the smaller one)
> > into `SHUFFLE_HASH(build_table)` join hint, more concisely than
> > `SHUFFLE_HASH(left_table, right_table)`, WDYT?
> >
> > Best,
> > Lincoln Lee
> >
> >
> > Jing Zhang  于2022年1月15日周六 17:22写道:
> >
> > > Hi all,
> > > Thanks for all the feedback so far.
> > > If there is no more suggestions, I would like to drive a vote in
> Tuesday
> > > next week (18 Jan).
> > >
> > > Best,
> > > Jing Zhang
> > >
> > > Jing Zhang  于2022年1月5日周三 11:33写道:
> > >
> > > > Hi Francesco,
> > > > Thanks a lot for the feedback.
> > > >
> > > > > does it makes sense for a lookup join to use hash distribution
> whenever
> > > > is possible by default?
> > > > I prefer to enable the hash lookup join only find the hint in the
> query
> > > > for the following reason:
> > > > 1. Plan compatibility
> > > > Add a hash shuffle by default would leads to the change of plan
> after
> > > > users upgrade the flink version.
> > > > Besides, lookup join

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-20 Thread Jing Zhang
Hi Lincoln,

Thanks for the feedback.
> The previous discussion for the hint syntax
left a minor difference there: whether to use both sides of join table
names or just one 'build' side table name only. I would prefer the later
one.
 Users only need to pass the `build` side table(usually the smaller one)
into `SHUFFLE_HASH(build_table)` join hint, more concisely than
`SHUFFLE_HASH(left_table, right_table)`, WDYT?

Make sense.
Besides, Lookup Join only support dimension table as build table, it does
not support left input as build table because Lookup join is always
triggered by left side.
WDYT?

Best,
Jing Zhang



Jingsong Li  于2022年1月20日周四 15:09写道:

> Hi Jing,
>
> About `SHUFFLE_HASH(left_table, right_table)`, one case can be shared:
>
> SELECT * FROM left_t
>   JOIN right_1 ON ...
>   JOIN right_2 ON ...
>   JOIN right_3 ON ...
>
> What if we want to use shuffle_hash for all three joints?
>
> SELECT /*+ SHUFFLE_HASH('left_t', 'right_1', 'right_2', 'right_3') */ ?
>
> It does not work, because the left input of the second join is not
> 'left_t' anymore. It is the output of the first join.
>
> Best,
> Jingsong
>
> On Thu, Jan 20, 2022 at 2:33 PM Jingsong Li 
> wrote:
> >
> > Hi Jing,
> >
> > Sorry for the late reply!
> >
> > Is there a conclusion about naming here? (Maybe I missed something?)
> > Use USE_HASH or some other names. Slightly confusing in the FLIP.
> >
> > And the problem of what to write inside the hint, as mentioned by
> lincoln.
> >
> > I think maybe we can list the grammars of other distributed systems,
> > like Hive Spark(Databricks) Snowflake?
> >
> > Best,
> > Jingsong
> >
> > On Thu, Jan 20, 2022 at 1:56 PM Lincoln Lee 
> wrote:
> > >
> > > Hi, Jing,
> > >Sorry for the late reply!  The previous discussion for the hint
> syntax
> > > left a minor difference there: whether to use both sides of join table
> > > names or just one 'build' side table name only. I would prefer the
> later
> > > one.
> > >  Users only need to pass the `build` side table(usually the smaller
> one)
> > > into `SHUFFLE_HASH(build_table)` join hint, more concisely than
> > > `SHUFFLE_HASH(left_table, right_table)`, WDYT?
> > >
> > > Best,
> > > Lincoln Lee
> > >
> > >
> > > Jing Zhang  于2022年1月15日周六 17:22写道:
> > >
> > > > Hi all,
> > > > Thanks for all the feedback so far.
> > > > If there is no more suggestions, I would like to drive a vote in
> Tuesday
> > > > next week (18 Jan).
> > > >
> > > > Best,
> > > > Jing Zhang
> > > >
> > > > Jing Zhang  于2022年1月5日周三 11:33写道:
> > > >
> > > > > Hi Francesco,
> > > > > Thanks a lot for the feedback.
> > > > >
> > > > > > does it makes sense for a lookup join to use hash distribution
> whenever
> > > > > is possible by default?
> > > > > I prefer to enable the hash lookup join only find the hint in the
> query
> > > > > for the following reason:
> > > > > 1. Plan compatibility
> > > > > Add a hash shuffle by default would leads to the change of
> plan after
> > > > > users upgrade the flink version.
> > > > > Besides, lookup join is commonly used feature in flink SQL.
> > > > > 2. Not all flink jobs could benefit from this improvement.
> > > > > It is a trade off for the lookup join with dimension
> connectors which
> > > > > has cache inside.
> > > > > We hope the raise the cache hit ratio by Hash Lookup Join,
> however it
> > > > > would leads to an extra shuffle at the same time.
> > > > > It is not always a positive optimization, especially for the
> > > > > connectors which does not have cache inside.
> > > > >
> > > > > > Shouldn't the hint take the table alias as the "table name"?
> What if
> > > > > you do two lookup joins in cascade within the same query with the
> same
> > > > > table (once
> > > > > on a key, then on another one), where you use two different
> aliases for
> > > > > the table?
> > > > > In theory, it's better to support both table names and alias names.
> > > > > But in calcite, the alias name of subquery or table would not be
> lost in
> > > > > the sql conversion phase and sql optimization phase.
> > > > > So her

[VOTE] FLIP-204: Introduce Hash Lookup Join

2022-01-19 Thread Jing Zhang
Hi community,

I'd like to start a vote on FLIP-204: Introduce Hash Lookup Join [1] which
has been discussed in the thread [2].

The vote will be open for at least 72 hours unless there is an objection or
not enough votes.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join
[2] https://lists.apache.org/thread/swsg22lshh0ts41w4h9z4b0cgtfyfgxd

Best,
Jing Zhang


Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-15 Thread Jing Zhang
Hi all,
Thanks for all the feedback so far.
If there is no more suggestions, I would like to drive a vote in Tuesday
next week (18 Jan).

Best,
Jing Zhang

Jing Zhang  于2022年1月5日周三 11:33写道:

> Hi Francesco,
> Thanks a lot for the feedback.
>
> > does it makes sense for a lookup join to use hash distribution whenever
> is possible by default?
> I prefer to enable the hash lookup join only find the hint in the query
> for the following reason:
> 1. Plan compatibility
> Add a hash shuffle by default would leads to the change of plan after
> users upgrade the flink version.
> Besides, lookup join is commonly used feature in flink SQL.
> 2. Not all flink jobs could benefit from this improvement.
> It is a trade off for the lookup join with dimension connectors which
> has cache inside.
> We hope the raise the cache hit ratio by Hash Lookup Join, however it
> would leads to an extra shuffle at the same time.
> It is not always a positive optimization, especially for the
> connectors which does not have cache inside.
>
> > Shouldn't the hint take the table alias as the "table name"?  What if
> you do two lookup joins in cascade within the same query with the same
> table (once
> on a key, then on another one), where you use two different aliases for
> the table?
> In theory, it's better to support both table names and alias names.
> But in calcite, the alias name of subquery or table would not be lost in
> the sql conversion phase and sql optimization phase.
> So here we only support table names.
>
> Best,
> Jing Zhang
>
>
> Francesco Guardiani  于2022年1月3日周一 18:38写道:
>
>> Hi Jing,
>>
>> Thanks for the FLIP. I'm not very knowledgeable about the topic, but going
>> through both the FLIP and the discussion here, I wonder, does it makes
>> sense for a lookup join to use hash distribution whenever is possible by
>> default?
>>
>> The point you're explaining here:
>>
>> > Many Lookup table sources introduce cache in order
>> to reduce the RPC call, such as JDBC, CSV, HBase connectors.
>> For those connectors, we could raise cache hit ratio by routing the same
>> lookup keys to the same task instance
>>
>> Seems something we can infer automatically, rather than manually asking
>> the
>> user to add this hint to the query. Note that I'm not talking against the
>> hint syntax, which might still make sense to be introduced, but I feel
>> like
>> this optimization makes sense in the general case when using the
>> connectors
>> you have quoted. Perhaps there is some downside I'm not aware of?
>>
>> Talking about the hint themselves, taking this example as reference:
>>
>> SELECT /*+ SHUFFLE_HASH('Orders', 'Customers') */ o.order_id, o.total,
>> c.country, c.zip
>> FROM Orders AS o
>> JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
>> ON o.customer_id = c.id;
>>
>> Shouldn't the hint take the table alias as the "table name"? What If you
>> do
>> two lookup joins in cascade within the same query with the same table
>> (once
>> on a key, then on another one), where you use two different aliases for
>> the
>> table?
>>
>>
>> On Fri, Dec 31, 2021 at 9:56 AM Jing Zhang  wrote:
>>
>> > Hi Lincoln,
>> > Thanks for the feedback.
>> >
>> > > 1. For the hint name, +1 for WenLong's proposal.
>> >
>> > I've added add 'SHUFFLE_HASH' to other alternatives in FLIP. Let's
>> waiting
>> > for more voices here.
>> >
>> > > Regarding the `SKEW` hint, agree with you that it can be used widely,
>> and
>> > I
>> > prefer to treat it as a metadata hint, a new category differs from a
>> join
>> > hint.
>> > For your example:
>> > ```
>> > SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */
>> o.order_id,
>> > o.total, c.country, c.zip
>> > FROM Orders AS o
>> > JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
>> > ON o.customer_id = c.id;
>> > ```
>> > I would prefer another form:
>> > ```
>> > -- provide the skew info to let the engine choose the optimal plan
>> > SELECT /*+ SKEW('Orders') */ o.order_id, ...
>> >
>> > -- or introduce a new hint for the join case, e.g.,
>> > SELECT /*+ REPLICATED_SHUFFLE_HASH('Orders') */ o.order_id, ...
>> > ```
>> >
>> > Maybe there is misunderstanding here.
>> > I just use a syntax sugar here.
>> >
>> > SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */
>&

[jira] [Created] (FLINK-25645) UnsupportedOperationException would thrown out when hash shuffle by a field with array type

2022-01-13 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25645:
--

 Summary: UnsupportedOperationException would thrown out when hash 
shuffle by a field with array type
 Key: FLINK-25645
 URL: https://issues.apache.org/jira/browse/FLINK-25645
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jing Zhang
 Attachments: image-2022-01-13-19-12-40-756.png, 
image-2022-01-13-19-15-28-395.png

Currently array type is not supported as hash shuffle key because CodeGen does 
not support it yet.
 !image-2022-01-13-19-15-28-395.png! 

An unsupportedOperationException would thrown out when hash shuffle by a field 
with array type,
 !image-2022-01-13-19-12-40-756.png! 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25641) Unexpected aggregate plan after load hive module

2022-01-13 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25641:
--

 Summary: Unexpected aggregate plan after load hive module
 Key: FLINK-25641
 URL: https://issues.apache.org/jira/browse/FLINK-25641
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jing Zhang
 Attachments: image-2022-01-13-15-52-27-783.png, 
image-2022-01-13-15-55-40-958.png

When using flink batch sql to run hive sql queries, we load hive module to use 
Hive built-in functions.
However some query plan plan are unexpected after loading hive module.
For the following sql,

{code:sql}
load module hive;
use modules hive,core;
set table.sql-dialect=hive;

select
  account_id,
  sum(impression)
from test_db.test_table where dt = '2022-01-10' and hi = '0100' group by 
account_id
{code}
The planner is:

 !image-2022-01-13-15-55-40-958.png! 

After remove 'load mudiles hive; use modules hive, core;', the planner is:

 !image-2022-01-13-15-52-27-783.png! 

After loading hive modules, hash aggregate is not final plan because the 
aggregate buffer is not fixed length which type is as following.
{code:java}
LEGACY('RAW', 
'ANY')
{code}






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25605) Batch get statistics of multiple partitions instead of get one by one

2022-01-11 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25605:
--

 Summary: Batch get statistics of multiple partitions instead of 
get one by one
 Key: FLINK-25605
 URL: https://issues.apache.org/jira/browse/FLINK-25605
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jing Zhang
 Attachments: image-2022-01-11-15-59-55-894.png, 
image-2022-01-11-16-00-28-002.png

Currently, `PushPartitionIntoTableSourceScanRule` would fetch statistics of 
matched partitions one by one.
 !image-2022-01-11-15-59-55-894.png! 
If there are multiple matched partitions, it costs much time to waiting for get 
all statistics.
We could make an improvement here to batch get statistics of multiple 
partitions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25604) Remove useless aggregate function

2022-01-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25604:
--

 Summary: Remove useless aggregate function 
 Key: FLINK-25604
 URL: https://issues.apache.org/jira/browse/FLINK-25604
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jing Zhang


We expect useless aggregate call could be removed after projection push down.
But sometimes, planner is unexpected. For example,

{code:sql}
SELECT 
  d
FROM (
  SELECT
d,
c,
row_number() OVER (PARTITION BY d ORDER BY e desc)  review_rank
  FROM (
SELECT e, d, max(f) AS c FROM Table5 GROUP BY e, d)
  )
WHERE review_rank = 1
{code}

The plan is 

{code:java}
Calc(select=[d], where=[=(w0$o0, 1:BIGINT)])
+- OverAggregate(partitionBy=[d], orderBy=[e DESC], window#0=[ROW_NUMBER(*) AS 
w0$o0 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW], select=[e, d, c, 
w0$o0])
   +- Sort(orderBy=[d ASC, e DESC])
  +- Exchange(distribution=[hash[d]])
 +- HashAggregate(isMerge=[true], groupBy=[e, d], select=[e, d, 
Final_MAX(max$0) AS c])
+- Exchange(distribution=[hash[e, d]])
   +- LocalHashAggregate(groupBy=[e, d], select=[e, d, 
Partial_MAX(f) AS max$0])
  +- Calc(select=[e, d, f])
 +- BoundedStreamScan(table=[[default_catalog, 
default_database, Table5]], fields=[d, e, f, g, h])
{code}

In the above sql, max(c) could be removed because it is projected out before 
sink.
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25596) Specify hash/sortmerge join in SQL hint

2022-01-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25596:
--

 Summary: Specify hash/sortmerge join in SQL hint
 Key: FLINK-25596
 URL: https://issues.apache.org/jira/browse/FLINK-25596
 Project: Flink
  Issue Type: Sub-task
Reporter: Jing Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25595) Specify hash/sort aggregate strategy in SQL hint

2022-01-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25595:
--

 Summary: Specify hash/sort aggregate strategy in SQL hint
 Key: FLINK-25595
 URL: https://issues.apache.org/jira/browse/FLINK-25595
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jing Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25594) Take parquet metadata into consideration when source is parquet files

2022-01-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25594:
--

 Summary: Take parquet metadata into consideration when source is 
parquet files
 Key: FLINK-25594
 URL: https://issues.apache.org/jira/browse/FLINK-25594
 Project: Flink
  Issue Type: Sub-task
Reporter: Jing Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25593) A redundant scan could be skipped if it is an input of join and the other input is empty after partition prune

2022-01-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25593:
--

 Summary: A redundant scan could be skipped if it is an input of 
join and the other input is empty after partition prune
 Key: FLINK-25593
 URL: https://issues.apache.org/jira/browse/FLINK-25593
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Jing Zhang


A redundant scan could be skipped if it is an input of join and the other input 
is empty after partition prune.
For example:
ltable has two partitions: pt=0 ad pt=1, rtable has one partition pt1=0.
The schema of ltable is (lkey string, value int).
The schema of rtable is (rkey string, value int).

{code:java}
SELECT * FROM ltable, rtable WHERE pt=2 and pt1=0 and `lkey`=rkey
{code}

The plan is as following.

{code:java}
Calc(select=[lkey, value, CAST(2 AS INTEGER) AS pt, rkey, value1, CAST(0 AS 
INTEGER) AS pt1])
+- HashJoin(joinType=[InnerJoin], where=[=(lkey, rkey)], select=[lkey, value, 
rkey, value1], build=[right])
   :- Exchange(distribution=[hash[lkey]])
   :  +- TableSourceScan(table=[[hive, source_db, ltable, partitions=[], 
project=[lkey, value]]], fields=[lkey, value])
   +- Exchange(distribution=[hash[rkey]])
  +- TableSourceScan(table=[[hive, source_db, rtable, partitions=[{pt1=0}], 
project=[rkey, value1]]], fields=[rkey, value1])
{code}

There is no need to scan right side because the left input of join has 0 
partitions after partition prune.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25592) Improvement of parser, optimizer and execution for Flink Batch SQL

2022-01-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25592:
--

 Summary: Improvement of parser, optimizer and execution for Flink 
Batch SQL
 Key: FLINK-25592
 URL: https://issues.apache.org/jira/browse/FLINK-25592
 Project: Flink
  Issue Type: Improvement
Reporter: Jing Zhang


This is a parent JIRA to track all improvements on Flink Batch SQL, including 
parser, optimizer and execution.
For example,
1. using Hive dialect and default dialect, some sql query would be translated 
into different plans
2. specify hash/sort aggregate strategy and hash/sortmerge join strategy in sql 
hint
3. take parquet metadata into consideration in optimization
4. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Re: [DISCUSS] Introduce Hash Lookup Join

2022-01-04 Thread Jing Zhang
Hi Francesco,
Thanks a lot for the feedback.

> does it makes sense for a lookup join to use hash distribution whenever
is possible by default?
I prefer to enable the hash lookup join only find the hint in the query for
the following reason:
1. Plan compatibility
Add a hash shuffle by default would leads to the change of plan after
users upgrade the flink version.
Besides, lookup join is commonly used feature in flink SQL.
2. Not all flink jobs could benefit from this improvement.
It is a trade off for the lookup join with dimension connectors which
has cache inside.
We hope the raise the cache hit ratio by Hash Lookup Join, however it
would leads to an extra shuffle at the same time.
It is not always a positive optimization, especially for the connectors
which does not have cache inside.

> Shouldn't the hint take the table alias as the "table name"?  What if you
do two lookup joins in cascade within the same query with the same table
(once
on a key, then on another one), where you use two different aliases for the
table?
In theory, it's better to support both table names and alias names.
But in calcite, the alias name of subquery or table would not be lost in
the sql conversion phase and sql optimization phase.
So here we only support table names.

Best,
Jing Zhang


Francesco Guardiani  于2022年1月3日周一 18:38写道:

> Hi Jing,
>
> Thanks for the FLIP. I'm not very knowledgeable about the topic, but going
> through both the FLIP and the discussion here, I wonder, does it makes
> sense for a lookup join to use hash distribution whenever is possible by
> default?
>
> The point you're explaining here:
>
> > Many Lookup table sources introduce cache in order
> to reduce the RPC call, such as JDBC, CSV, HBase connectors.
> For those connectors, we could raise cache hit ratio by routing the same
> lookup keys to the same task instance
>
> Seems something we can infer automatically, rather than manually asking the
> user to add this hint to the query. Note that I'm not talking against the
> hint syntax, which might still make sense to be introduced, but I feel like
> this optimization makes sense in the general case when using the connectors
> you have quoted. Perhaps there is some downside I'm not aware of?
>
> Talking about the hint themselves, taking this example as reference:
>
> SELECT /*+ SHUFFLE_HASH('Orders', 'Customers') */ o.order_id, o.total,
> c.country, c.zip
> FROM Orders AS o
> JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
> ON o.customer_id = c.id;
>
> Shouldn't the hint take the table alias as the "table name"? What If you do
> two lookup joins in cascade within the same query with the same table (once
> on a key, then on another one), where you use two different aliases for the
> table?
>
>
> On Fri, Dec 31, 2021 at 9:56 AM Jing Zhang  wrote:
>
> > Hi Lincoln,
> > Thanks for the feedback.
> >
> > > 1. For the hint name, +1 for WenLong's proposal.
> >
> > I've added add 'SHUFFLE_HASH' to other alternatives in FLIP. Let's
> waiting
> > for more voices here.
> >
> > > Regarding the `SKEW` hint, agree with you that it can be used widely,
> and
> > I
> > prefer to treat it as a metadata hint, a new category differs from a join
> > hint.
> > For your example:
> > ```
> > SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */ o.order_id,
> > o.total, c.country, c.zip
> > FROM Orders AS o
> > JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
> > ON o.customer_id = c.id;
> > ```
> > I would prefer another form:
> > ```
> > -- provide the skew info to let the engine choose the optimal plan
> > SELECT /*+ SKEW('Orders') */ o.order_id, ...
> >
> > -- or introduce a new hint for the join case, e.g.,
> > SELECT /*+ REPLICATED_SHUFFLE_HASH('Orders') */ o.order_id, ...
> > ```
> >
> > Maybe there is misunderstanding here.
> > I just use a syntax sugar here.
> >
> > SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */ o.order_id,
> > 
> >
> > is just a syntax with
> >
> > SELECT /*+ USE_HASH('Orders', 'Customers') */ /*+SKEW('Orders') */
> > o.order_id,
> > 
> >
> > Although I list 'USE_HASH' and 'SKEW' hint in a query hint clause, it
> does
> > not mean they must appear together as a whole.
> > Based on calcite syntax doc [1], you could list more than one hint in
> > a /*+' hint [, hint ]* '*/ clause.
> >
> > Each hint has different function.
> > The'USE_HASH' hint suggests the optimizer use hash partitioner for Lookup
> > Join for table 'Orders' and table 'Customers' while the 'SKEW' hint tells
> > the optimizer t

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-31 Thread Jing Zhang
Hi Lincoln,
Thanks for the feedback.

> 1. For the hint name, +1 for WenLong's proposal.

I've added add 'SHUFFLE_HASH' to other alternatives in FLIP. Let's waiting
for more voices here.

> Regarding the `SKEW` hint, agree with you that it can be used widely, and
I
prefer to treat it as a metadata hint, a new category differs from a join
hint.
For your example:
```
SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */ o.order_id,
o.total, c.country, c.zip
FROM Orders AS o
JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id;
```
I would prefer another form:
```
-- provide the skew info to let the engine choose the optimal plan
SELECT /*+ SKEW('Orders') */ o.order_id, ...

-- or introduce a new hint for the join case, e.g.,
SELECT /*+ REPLICATED_SHUFFLE_HASH('Orders') */ o.order_id, ...
```

Maybe there is misunderstanding here.
I just use a syntax sugar here.

SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */ o.order_id,


is just a syntax with

SELECT /*+ USE_HASH('Orders', 'Customers') */ /*+SKEW('Orders') */
o.order_id,


Although I list 'USE_HASH' and 'SKEW' hint in a query hint clause, it does
not mean they must appear together as a whole.
Based on calcite syntax doc [1], you could list more than one hint in
a /*+' hint [, hint ]* '*/ clause.

Each hint has different function.
The'USE_HASH' hint suggests the optimizer use hash partitioner for Lookup
Join for table 'Orders' and table 'Customers' while the 'SKEW' hint tells
the optimizer the skew metadata about the table 'Orders'.

Best,
Jing Zhang

[1] https://calcite.apache.org/docs/reference.html#sql-hints




Jing Zhang  于2021年12月31日周五 16:39写道:

> Hi Martijn,
> Thanks for the feedback.
>
> Glad to hear that we reached a consensus on the first and second point.
>
> About whether to use `use_hash` as a term, I think your concern makes
> sense.
> Although the hash lookup join is similar to Hash join in oracle that they
> all require hash distribution on input, there exists a little difference
> between them.
> About this point, Lincoln and WenLong both prefer the term 'SHUFFLE_HASH',
> WDYT?
>
> Best,
> Jing Zhang
>
>
> Lincoln Lee  于2021年12月30日周四 11:21写道:
>
>> Hi Jing,
>> Thanks for your explanation!
>>
>> 1. For the hint name, +1 for WenLong's proposal. I think the `SHUFFLE`
>> keyword is important in a classic distributed computing system,
>> a hash-join usually means there's a shuffle stage(include shuffle
>> hash-join, broadcast hash-join). Users only need to pass the `build` side
>> table(usually the smaller one) into `SHUFFLE_HASH` join hint, more
>> concisely than `USE_HASH(left_table, right_table)`. Please correct me if
>> my
>> understanding is wrong.
>> Regarding the `SKEW` hint, agree with you that it can be used widely, and
>> I
>> prefer to treat it as a metadata hint, a new category differs from a join
>> hint.
>> For your example:
>> ```
>> SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */ o.order_id,
>> o.total, c.country, c.zip
>> FROM Orders AS o
>> JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
>> ON o.customer_id = c.id;
>> ```
>> I would prefer another form:
>> ```
>> -- provide the skew info to let the engine choose the optimal plan
>> SELECT /*+ SKEW('Orders') */ o.order_id, ...
>>
>> -- or introduce a new hint for the join case, e.g.,
>> SELECT /*+ REPLICATED_SHUFFLE_HASH('Orders') */ o.order_id, ...
>> ```
>>
>> 2. Agree with Martin adding the feature to 1.16, we need time to complete
>> the change in calcite and also the upgrading work.
>>
>> 3. I misunderstood the 'Other Alternatives' part as the 'Rejected' ones in
>> the FLIP doc. And my point is avoiding the hacky way with our best effort.
>> The potential issues for calcite's hint propagation, e.g., join hints
>> correctly propagate into proper join scope include subquery or views which
>> may have various sql operators, so we should check all kinds of operators
>> for the correct propagation. Hope this may help. And also cc @Shuo Cheng
>> may
>> offer more help.
>>
>>
>> Best,
>> Lincoln Lee
>>
>>
>> Martijn Visser  于2021年12月29日周三 22:21写道:
>>
>> > Hi Jing,
>> >
>> > Thanks for explaining this in more detail and also to others
>> > participating.
>> >
>> > > I think using query hints in this case is more natural for users,
>> WDYT?
>> >
>> > Yes, I agree. As long as we properly explain in our documentation that
>> we
>> > support both Query Hints and Table Hints, what's the difference between
>> > them and how to use them, I

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-31 Thread Jing Zhang
Hi Martijn,
Thanks for the feedback.

Glad to hear that we reached a consensus on the first and second point.

About whether to use `use_hash` as a term, I think your concern makes sense.
Although the hash lookup join is similar to Hash join in oracle that they
all require hash distribution on input, there exists a little difference
between them.
About this point, Lincoln and WenLong both prefer the term 'SHUFFLE_HASH',
WDYT?

Best,
Jing Zhang


Lincoln Lee  于2021年12月30日周四 11:21写道:

> Hi Jing,
> Thanks for your explanation!
>
> 1. For the hint name, +1 for WenLong's proposal. I think the `SHUFFLE`
> keyword is important in a classic distributed computing system,
> a hash-join usually means there's a shuffle stage(include shuffle
> hash-join, broadcast hash-join). Users only need to pass the `build` side
> table(usually the smaller one) into `SHUFFLE_HASH` join hint, more
> concisely than `USE_HASH(left_table, right_table)`. Please correct me if my
> understanding is wrong.
> Regarding the `SKEW` hint, agree with you that it can be used widely, and I
> prefer to treat it as a metadata hint, a new category differs from a join
> hint.
> For your example:
> ```
> SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */ o.order_id,
> o.total, c.country, c.zip
> FROM Orders AS o
> JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
> ON o.customer_id = c.id;
> ```
> I would prefer another form:
> ```
> -- provide the skew info to let the engine choose the optimal plan
> SELECT /*+ SKEW('Orders') */ o.order_id, ...
>
> -- or introduce a new hint for the join case, e.g.,
> SELECT /*+ REPLICATED_SHUFFLE_HASH('Orders') */ o.order_id, ...
> ```
>
> 2. Agree with Martin adding the feature to 1.16, we need time to complete
> the change in calcite and also the upgrading work.
>
> 3. I misunderstood the 'Other Alternatives' part as the 'Rejected' ones in
> the FLIP doc. And my point is avoiding the hacky way with our best effort.
> The potential issues for calcite's hint propagation, e.g., join hints
> correctly propagate into proper join scope include subquery or views which
> may have various sql operators, so we should check all kinds of operators
> for the correct propagation. Hope this may help. And also cc @Shuo Cheng
> may
> offer more help.
>
>
> Best,
> Lincoln Lee
>
>
> Martijn Visser  于2021年12月29日周三 22:21写道:
>
> > Hi Jing,
> >
> > Thanks for explaining this in more detail and also to others
> > participating.
> >
> > > I think using query hints in this case is more natural for users, WDYT?
> >
> > Yes, I agree. As long as we properly explain in our documentation that we
> > support both Query Hints and Table Hints, what's the difference between
> > them and how to use them, I think our users can understand this
> perfectly.
> >
> > > I admit upgrading from Calcite 1.26 to 1.30 would be a big change.
> > However we could not always avoid upgrade for the following reason
> >
> > We have to upgrade Calcite. We actually considered putting that in the
> > Flink 1.15 scope but ultimately had to drop it, but I definitely think
> this
> > needs to be done for 1.16. It's not only because of new features that are
> > depending on Calcite upgrades, but also because newer versions have
> > resolved bugs that also hurt our users. That's why we also already have
> > tickets for upgrading to Calcite 1.27 [1] and 1.28 [2].
> >
> > With regards to using `use_hash` as a term, I think the most important
> part
> > is that if we re-use a term like Oracle is using, is that the behaviour
> and
> > outcome should be the same/comparable to the one from (in this case)
> > Oracle. If their behaviour and outcome are not the same or comparable, I
> > would probably introduce our own term to avoid that users get confused.
> >
> > Best regards,
> >
> > Martijn
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-20873
> > [2] https://issues.apache.org/jira/browse/FLINK-21239
> >
> > On Wed, 29 Dec 2021 at 14:18, Jing Zhang  wrote:
> >
> > > Hi Jian gang,
> > > Thanks for the feedback.
> > >
> > > > When it comes to hive, how do you load partial data instead of the
> > >whole data? Any change related with hive?
> > >
> > > The question is same as Yuan mentioned before.
> > > I prefer to drive another FLIP on this topic to further discussion
> > > individually because this point involves many extension on API.
> > > Here I would like to share the implementation in our internal version
> > > firstly, it maybe very different with the final soluti

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread Jing Zhang
Hi Jian gang,
Thanks for the feedback.

> When it comes to hive, how do you load partial data instead of the
   whole data? Any change related with hive?

The question is same as Yuan mentioned before.
I prefer to drive another FLIP on this topic to further discussion
individually because this point involves many extension on API.
Here I would like to share the implementation in our internal version
firstly, it maybe very different with the final solution which merged to
community.
The core idea is push the partitioner information down to the lookup table
source.
Hive connector need also upgrades. When loading data into caches, each task
could only store records which look keys are sent to current task.

> How to define the cache configuration? For example, the size and the ttl.

I'm afraid there is no a unify caching configuration and cache
implementation of different connectors yet.
You could find cache size and ttl config of JDBC in doc [1], HBase in doc
[2]

>  Will this feature add another shuffle phase compared with the default
   behavior? In what situations will user choose this feature?

Yes, if user specify hash hint in query, optimizer would prefer to choose
Hash Lookup Join, which would add a Hash Shuffle.
If lookup table source has cache inside (for example HBase/Jdbc) and the
benefit of increasing cache hit ratio is bigger than add an extra shuffle
cost, the user could use Hash Lookup Join.

>  For the keys, the default implementation will be ok. But I wonder
whether we can support more flexible strategies.

The question is same as Yuan mentioned before.

I'm afraid there is no plan to support flexible strategies yet because the
feature involves many things, for example:
1. sql syntax
2. user defined partitioner API
3. RelDistribution type extension and Flink RelDistribution extension
4. FlinkExpandConversionRule
5. Exchange execNode extension
6. 
It needs well designed and more discussion. If this is a strong
requirement, we would drive another discussion on this point individually.
In this FLIP, I would first support hash shuffle. WDYT?

Best,
Jing Zhang

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/jdbc/#connector-options
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/hbase/#connector-options

Jing Zhang  于2021年12月29日周三 20:37写道:

> Hi Wenlong,
> Thanks for the feedback.
> I've checked similar syntax in other systems, they are all different from
> each other. It seems to be without consensus.
> As mentioned in FLIP-204, oracle uses a query hint, the hint name is
> 'use_hash' [1].
> Spark also uses a query hint, its name is 'SHUFFLE_HASH' [2].
> SQL Server uses keyword 'HASH' instead of query hint [3].
> Note, the purposes of hash shuffle in [1][2][3] are a little different
> from the purpose of FLIP-204, we just discuss syntax here.
>
> I've added this part to FLIP waiting for further discussion.
>
> Best,
> Jing Zhang
>
> [1]
> https://docs.oracle.com/cd/B12037_01/server.101/b10752/hintsref.htm#5683
> [2]
> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html
> [3]
> https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-join?view=sql-server-ver15
>
>
> wenlong.lwl  于2021年12月29日周三 17:18写道:
>
>> Hi, Jing, thanks for driving the discussion.
>>
>> Have you made some investigation on the syntax of join hint?
>> Why do you choose USE_HASH from oracle instead of the style of spark
>> SHUFFLE_HASH, they are quite different.
>> People in the big data world may be more familiar with spark/hive, if we
>> need to choose one, personally, I prefer the style of spark.
>>
>>
>> Best,
>> Wenlong
>>
>> On Wed, 29 Dec 2021 at 16:48, zst...@163.com  wrote:
>>
>> >
>> >
>> >
>> > Hi Jing,
>> > Thanks for your detail reply.
>> > 1) In the last suggestion, hash by primary key is not use for raising
>> the
>> > cache hit, but handling with skew of left source. Now that you have
>> 'skew'
>> > hint and other discussion about it, I'm looking forward to it.
>> > 2) I mean to support user defined partitioner function. We have a case
>> > that joining a datalake source with special way of partition, and have
>> > implemented not elegantly in our internal version. As you said, it needs
>> > more design.
>> > 3) I thing so-called 'HashPartitionedCache' is usefull, otherwise
>> loading
>> > all data such as hive lookup table source is almost not available in big
>> > data.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Best regards,
>> > Yuan
>> >
>> >
>> >
>> >
>> >
>

Re: Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-29 Thread Jing Zhang
 Hi Wenlong,
Thanks for the feedback.
I've checked similar syntax in other systems, they are all different from
each other. It seems to be without consensus.
As mentioned in FLIP-204, oracle uses a query hint, the hint name is
'use_hash' [1].
Spark also uses a query hint, its name is 'SHUFFLE_HASH' [2].
SQL Server uses keyword 'HASH' instead of query hint [3].
Note, the purposes of hash shuffle in [1][2][3] are a little different from
the purpose of FLIP-204, we just discuss syntax here.

I've added this part to FLIP waiting for further discussion.

Best,
Jing Zhang

[1] https://docs.oracle.com/cd/B12037_01/server.101/b10752/hintsref.htm#5683
[2] https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html
[3]
https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-join?view=sql-server-ver15


wenlong.lwl  于2021年12月29日周三 17:18写道:

> Hi, Jing, thanks for driving the discussion.
>
> Have you made some investigation on the syntax of join hint?
> Why do you choose USE_HASH from oracle instead of the style of spark
> SHUFFLE_HASH, they are quite different.
> People in the big data world may be more familiar with spark/hive, if we
> need to choose one, personally, I prefer the style of spark.
>
>
> Best,
> Wenlong
>
> On Wed, 29 Dec 2021 at 16:48, zst...@163.com  wrote:
>
> >
> >
> >
> > Hi Jing,
> > Thanks for your detail reply.
> > 1) In the last suggestion, hash by primary key is not use for raising the
> > cache hit, but handling with skew of left source. Now that you have
> 'skew'
> > hint and other discussion about it, I'm looking forward to it.
> > 2) I mean to support user defined partitioner function. We have a case
> > that joining a datalake source with special way of partition, and have
> > implemented not elegantly in our internal version. As you said, it needs
> > more design.
> > 3) I thing so-called 'HashPartitionedCache' is usefull, otherwise loading
> > all data such as hive lookup table source is almost not available in big
> > data.
> >
> >
> >
> >
> >
> >
> >
> > Best regards,
> > Yuan
> >
> >
> >
> >
> >
> >
> >
> >
> > 在 2021-12-29 14:52:11,"Jing Zhang"  写道:
> > >Hi, Lincoln
> > >Thanks a lot for the feedback.
> > >
> > >>  Regarding the hint name ‘USE_HASH’, could we consider more
> candidates?
> > >Things are a little different from RDBMS in the distributed world, and
> we
> > >also aim to solve the data skew problem, so all these incoming hints
> names
> > >should be considered together.
> > >
> > >About skew problem, I would discuss this in next FLIP individually. I
> > would
> > >like to share hint proposal for skew here.
> > >I want to introduce 'skew' hint which is a query hint, similar with skew
> > >hint in spark [1] and MaxCompute[2].
> > >The 'skew' hint could only contain the name of the table with skew.
> > >Besides, skew hint could accept table name and column names.
> > >In addition, skew hint could accept table name, column names and skew
> > >values.
> > >For example:
> > >
> > >SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */
> o.order_id,
> > >o.total, c.country, c.zip
> > >FROM Orders AS o
> > >JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
> > >ON o.customer_id = c.id;
> > >
> > >The 'skew' hint is not only used for look up join here, but also could
> be
> > >used for other types of join later, for example, batch hash join or
> > >streaming regular join.
> > >Go back to better name problem for hash look up join. Since the 'skew'
> > hint
> > >is a separate hint, so 'use_hash' is still an alternative.
> > >WDYT?
> > >I don't have a good idea about the better hint name yet. I would like to
> > >heard more suggestions about hint names.
> > >
> > >>  As you mentioned in the flip, this solution depends on future changes
> > to
> > >calcite (and also upgrading calcite would be another possible big
> change:
> > >at least calicite-1.30 vs 1.26, are we preparing to accept this big
> > >change?).
> > >
> > >Indeed, solution 1 depends on calcite upgrade.
> > >I admit upgrade from Calcite 1.26 to 1.30 would be a big change. I still
> > >remember what we have suffered from last upgrade to Calcite 1.26.
> > >However we could not always avoid upgrade for the following reason:
> > >1. Other features also depends on the Calcite upgrade. For example,
> > Session
&g

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Jing Zhang
Hi, Lincoln
Thanks a lot for the feedback.

>  Regarding the hint name ‘USE_HASH’, could we consider more candidates?
Things are a little different from RDBMS in the distributed world, and we
also aim to solve the data skew problem, so all these incoming hints names
should be considered together.

About skew problem, I would discuss this in next FLIP individually. I would
like to share hint proposal for skew here.
I want to introduce 'skew' hint which is a query hint, similar with skew
hint in spark [1] and MaxCompute[2].
The 'skew' hint could only contain the name of the table with skew.
Besides, skew hint could accept table name and column names.
In addition, skew hint could accept table name, column names and skew
values.
For example:

SELECT /*+ USE_HASH('Orders', 'Customers'), SKEW('Orders') */ o.order_id,
o.total, c.country, c.zip
FROM Orders AS o
JOIN Customers FOR SYSTEM_TIME AS OF o.proc_time AS c
ON o.customer_id = c.id;

The 'skew' hint is not only used for look up join here, but also could be
used for other types of join later, for example, batch hash join or
streaming regular join.
Go back to better name problem for hash look up join. Since the 'skew' hint
is a separate hint, so 'use_hash' is still an alternative.
WDYT?
I don't have a good idea about the better hint name yet. I would like to
heard more suggestions about hint names.

>  As you mentioned in the flip, this solution depends on future changes to
calcite (and also upgrading calcite would be another possible big change:
at least calicite-1.30 vs 1.26, are we preparing to accept this big
change?).

Indeed, solution 1 depends on calcite upgrade.
I admit upgrade from Calcite 1.26 to 1.30 would be a big change. I still
remember what we have suffered from last upgrade to Calcite 1.26.
However we could not always avoid upgrade for the following reason:
1. Other features also depends on the Calcite upgrade. For example, Session
Window and Count Window.
2. If we always avoid Calcite upgrade, there would be more gap with the
latest version. One day, if upgrading becomes a thing which has to be done,
the pain is more.

WDYT?

>  Is there another possible way to minimize the change in calcite?

Do you check the 'Other Alternatives' part in the FLIP-204? It gives
another solution which does not depend on calcite upgrade and do not need
to worry about the hint would be missed in the propagation.
This is also what we have done in the internal version.
The core idea is propagating 'use_hash' hint to TableScan with matched
table names.  However, it is a little hacky.

> As I know there're more limitations than `Correlate`.

As mentioned before, in our external version, I choose the the 'Other
Alternatives' part in the FLIP-204.
Although I do a POC in the solution 1 and lists all changes I found in the
FLIP, there may still be something I missed.
I'm very happy to hear that you point out there're more limitations except
for `Correlate`, would you please give more details on this part?

Best,
Jing Zhang

[1] https://docs.databricks.com/delta/join-performance/skew-join.html
[2]
https://help.aliyun.com/apsara/enterprise/v_3_13_0_20201215/odps/enterprise-ascm-user-guide/hotspot-tilt.html?spm=a2c4g.14484438.10001.669

Jing Zhang  于2021年12月29日周三 14:40写道:

> Hi Yuan and Lincoln,
> thanks a lot for the attention. I would answer the email one by one.
>
> To Yuan
> > How shall we deal with CDC data? If there is CDC data in the pipeline,
> IMHO, shuffle by join key will cause CDC data disorder. Will it be better
> to use primary key in this case?
>
> Good question.
> The problem could not only exists in CDC data source, but also exists when
> the input stream is not insert-only stream (for example, the result of
> unbounded aggregate or regular join).
> I think use hash by primary key is not a good choise. It could not raise
> the cache hit because cache key is look up key instead of primary key of
> input.
>
> To avoid wrong result, hash lookup Join requires that the input stream
> should be insert_only stream or its upsert keys contains lookup keys.
>
> I've added this limitation to FLIP, thanks a lot for reminding.
>
> > If the shuffle keys can be customized  when users have the knowledge
> about distribution of data?
>
> I'm not sure I understand your question.
>
> Do you mean to support user defined partitioner function on keys just like
> flink DataStream sql?
> If yes, I'm afraid there is no plan to support this feature yet because
> the feature involves many things, for example:
> 1. sql syntax
> 2. user defined partitioner API
> 3. RelDistribution type extension and Flink RelDistribution extension
> 4. FlinkExpandConversionRule
> 5. Exchange execNode extension
> 6. 
> It needs well designed and more discussion. If this is a strong
> requirement, we would drive another discussion on this point individually.
> In th

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Jing Zhang
Hi Yuan and Lincoln,
thanks a lot for the attention. I would answer the email one by one.

To Yuan
> How shall we deal with CDC data? If there is CDC data in the pipeline,
IMHO, shuffle by join key will cause CDC data disorder. Will it be better
to use primary key in this case?

Good question.
The problem could not only exists in CDC data source, but also exists when
the input stream is not insert-only stream (for example, the result of
unbounded aggregate or regular join).
I think use hash by primary key is not a good choise. It could not raise
the cache hit because cache key is look up key instead of primary key of
input.

To avoid wrong result, hash lookup Join requires that the input stream
should be insert_only stream or its upsert keys contains lookup keys.

I've added this limitation to FLIP, thanks a lot for reminding.

> If the shuffle keys can be customized  when users have the knowledge
about distribution of data?

I'm not sure I understand your question.

Do you mean to support user defined partitioner function on keys just like
flink DataStream sql?
If yes, I'm afraid there is no plan to support this feature yet because the
feature involves many things, for example:
1. sql syntax
2. user defined partitioner API
3. RelDistribution type extension and Flink RelDistribution extension
4. FlinkExpandConversionRule
5. Exchange execNode extension
6. 
It needs well designed and more discussion. If this is a strong
requirement, we would drive another discussion on this point individually.
In this FLIP, I would first support hash shuffle. WDYT?

Or do you mean support hash by other keys instead of lookup key?
If yes, would you please tell me a specific user case?
We need to fetch the record from external storage of dimension table by
look up key, so those dimension table source uses look up keys as cache
key.
We could only increase  the cache ratio by shuffle lookup keys.
I need more use cases to understand this requirement.

> Some connectors such as hive, caches all data in LookupFunction. How to
decrease the valid cache data size if data can be shuffled?

Very good idea.
There are two types of cache.
For Key-Value storage, such as Redis/HBase, the lookup table source stores
the visited lookup keys and it's record into cache lazily.
For other storage without keys, such as hive, each task loads all data into
cache eagerly in the initialize phase.
After introduce hash partitioner, for key-value storages, there is no need
to change; for hive, each task could only load part of cache instead of
load all cache.

We have implemented this optimization in our internal version.
The core idea is push the partitioner information down to the lookup table
source. When loading data into caches, each task could only store those
records which look keys are sent to current task.
We called this 'HashPartitionedCache'.

I have added this point into the Lookup Join requirements list in the
motivation of the FLIP, but I would not do this point in this FLIP right
now.
If this is a strong requirement, we need drive another discussion on this
topic individually because this point involves many extension on API.

Best,
Jing Zhang


Lincoln Lee  于2021年12月29日周三 10:01写道:

> Hi Jing,
> Thanks for bringing up this discussion!  Agree that this join hints
> should benefit both bounded and unbounded cases as Martin mentioned.
> I also agree that implementing the query hint is the right way for a more
> general purpose since the dynamic table options has a limited scope.
>Some points I'd like to share are:
> 1. Regarding the hint name ‘USE_HASH’, could we consider more candidates?
> Things are a little different from RDBMS in the distributed world, and we
> also aim to solve the data skew problem, so all these incoming hints names
> should be considered together.
> 2. As you mentioned in the flip, this solution depends on future changes to
> calcite (and also upgrading calcite would be another possible big change:
> at least calicite-1.30 vs 1.26, are we preparing to accept this big
> change?). Is there another possible way to minimize the change in calcite?
> As I know there're more limitations than `Correlate`.
>
> Best,
> Lincoln Lee
>
>
> Jing Zhang  于2021年12月28日周二 23:04写道:
>
> > Hi Martijn,
> > Thanks a lot for your attention.
> > I'm sorry I didn't explain the motivation clearly. I would like to
> explain
> > it in detail, and then give response on your questions.
> > A lookup join is typically used to enrich a table with data that is
> queried
> > from an external system. Many Lookup table sources introduce cache in
> order
> > to reduce the RPC call, such as JDBC, CSV, HBase connectors.
> > For those connectors, we could raise cache hit ratio by routing the same
> > lookup keys to the same task instance. This is the purpose of
> >
> >
> https://cwiki.apach

Re: [DISCUSS] Introduce Hash Lookup Join

2021-12-28 Thread Jing Zhang
Hi Martijn,
Thanks a lot for your attention.
I'm sorry I didn't explain the motivation clearly. I would like to explain
it in detail, and then give response on your questions.
A lookup join is typically used to enrich a table with data that is queried
from an external system. Many Lookup table sources introduce cache in order
to reduce the RPC call, such as JDBC, CSV, HBase connectors.
For those connectors, we could raise cache hit ratio by routing the same
lookup keys to the same task instance. This is the purpose of
https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join
.
Other cases might benefit from Hash distribution, such as batch hash join
as you mentioned. It is a cool idea, however it is not the purpose of this
FLIP, we could discuss this in FLINK-20670
<https://issues.apache.org/jira/browse/FLINK-20670>.

> - When I was reading about this topic [1] I was wondering if this feature
would be more beneficial for bounded use cases and not so much for
unbounded use cases. What do you think?

As mentioned before, the purpose of Hash Lookup Join is to increase the
cache hit ratio which is different from Oracle Hash Join. However we could
use the similar hint syntax.

> - If I look at the current documentation for SQL Hints in Flink [2], I
notice that all of the hints there are located at the end of the SQL
statement. In the FLIP, the use_hash is defined directly after the 'SELECT'
keyword. Can we somehow make this consistent for the user? Or should the
user be able to specify hints anywhere in its SQL statement?

Calcite supports hints in two locations [3]:
Query Hint: right after the SELECT keyword;
Table Hint: right after the referenced table name.
Now Flink has supported dynamic table options based on the Hint framework
of Calcite which is mentioned in doc[2].
Besides, query hints are also important, it could give a hint for
optimizers to choose a better plan. Almost all popular databases and
big-data engines support sql query hints, such as oracle, hive, spark and
so on.
I think using query hints in this case is more natural for users, WDYT?

I have updated the motivation part in the FLIP,
Thanks for the feedback!

[1] https://logicalread.com/oracle-11g-hash-joins-mc02/#.V5Wm4_mnoUI
[2]
https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/hints/
[3] https://calcite.apache.org/docs/reference.html#sql-hints

Best,
Jing Zhang

Martijn Visser  于2021年12月28日周二 22:02写道:

> Hi Jing,
>
> Thanks a lot for the explanation and the FLIP. I definitely learned
> something when reading more about `use_hash`. My interpretation would be
> that the primary benefit of a hash lookup join would be improved
> performance by allowing the user to explicitly optimise the planner.
>
> I have a couple of questions:
>
> - When I was reading about this topic [1] I was wondering if this feature
> would be more beneficial for bounded use cases and not so much for
> unbounded use cases. What do you think?
> - If I look at the current documentation for SQL Hints in Flink [2], I
> notice that all of the hints there are located at the end of the SQL
> statement. In the FLIP, the use_hash is defined directly after the 'SELECT'
> keyword. Can we somehow make this consistent for the user? Or should the
> user be able to specify hints anywhere in its SQL statement?
>
> Best regards,
>
> Martijn
>
> [1] https://logicalread.com/oracle-11g-hash-joins-mc02/#.V5Wm4_mnoUI
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/hints/
>
>
> On Tue, 28 Dec 2021 at 08:17, Jing Zhang  wrote:
>
> > Hi everyone,
> > Look up join
> > <
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join
> > >[1]
> > is
> > commonly used feature in Flink SQL. We have received many optimization
> > requirements on look up join. For example:
> > 1. Enforces left side of lookup join do a hash partitioner to raise cache
> > hint ratio
> > 2. Solves the data skew problem after introduces hash lookup join
> > 3. Enables mini-batch optimization to reduce RPC call
> >
> > Next we will solve these problems one by one. Firstly,  we would focus on
> > point 1, and continue to discuss point 2 and point 3 later.
> >
> > There are many similar requirements from user mail list and JIRA about
> hash
> > Lookup Join, for example:
> > 1. FLINK-23687 <https://issues.apache.org/jira/browse/FLINK-23687> -
> > Introduce partitioned lookup join to enforce input of LookupJoin to hash
> > shuffle by lookup keys
> > 2. FLINK-25396 <https://issues.apache.org/jira/browse/FLINK-25396> -
> > lookupjoin source table for pre-partitioning
> > 3. FLINK-25262 <https://issues.apache.or

[DISCUSS] Introduce Hash Lookup Join

2021-12-27 Thread Jing Zhang
Hi everyone,
Look up join
<https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join>[1]
is
commonly used feature in Flink SQL. We have received many optimization
requirements on look up join. For example:
1. Enforces left side of lookup join do a hash partitioner to raise cache
hint ratio
2. Solves the data skew problem after introduces hash lookup join
3. Enables mini-batch optimization to reduce RPC call

Next we will solve these problems one by one. Firstly,  we would focus on
point 1, and continue to discuss point 2 and point 3 later.

There are many similar requirements from user mail list and JIRA about hash
Lookup Join, for example:
1. FLINK-23687 <https://issues.apache.org/jira/browse/FLINK-23687> -
Introduce partitioned lookup join to enforce input of LookupJoin to hash
shuffle by lookup keys
2. FLINK-25396 <https://issues.apache.org/jira/browse/FLINK-25396> -
lookupjoin source table for pre-partitioning
3. FLINK-25262 <https://issues.apache.org/jira/browse/FLINK-25262> -
Support to send data to lookup table for KeyGroupStreamPartitioner way for
SQL.

In this FLIP, I would like to start a discussion about Hash Lookup Join.
The core idea is introducing a 'USE_HASH' hint in query.  This syntax is
directly user-oriented and therefore requires careful design.
There are two ways about how to propagate this hint to LookupJoin in
optimizer. We need further discussion to do final decide. Anyway, the
difference between the two solution is only about the internal
implementation and has no impact on the user.

For more detail on the proposal:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join


Looking forward to your feedback, thanks.

Best,
Jing Zhang

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join


Re: [VOTE] FLIP-190: Support Version Upgrades for Table API & SQL Programs

2021-12-14 Thread Jing Zhang
+1

Thanks for driving this, Timo.

Best,
Jing Zhang

wenlong.lwl  于2021年12月15日周三 13:31写道:

> +1 (non-binding). Thanks for driving this Timo
>
>
> Best,
> Wenlong
>
> On Wed, 15 Dec 2021 at 03:21, Martijn Visser 
> wrote:
>
> > +1 (non-binding). Thanks for driving this Timo
> >
> > Op di 14 dec. 2021 om 17:45 schreef Timo Walther 
> >
> > > Hi everyone,
> > >
> > > I'd like to start a vote on FLIP-190: Support Version Upgrades for
> Table
> > > API & SQL Programs [1] which has been discussed in this thread [2].
> > >
> > > The vote will be open for at least 72 hours unless there is an
> objection
> > > or not enough votes.
> > >
> > > [1] https://cwiki.apache.org/confluence/x/KZBnCw
> > > [2] https://lists.apache.org/thread/n8v32j6o3d50mpblxydbz82q1q436ob4
> > >
> > > Cheers,
> > > Timo
> > >
> > --
> >
> > Martijn Visser | Product Manager
> >
> > mart...@ververica.com
> >
> > <https://www.ververica.com/>
> >
> >
> > Follow us @VervericaData
> >
> > --
> >
> > Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> > Conference
> >
> > Stream Processing | Event Driven | Real Time
> >
>


Re: [DISCUSS] Immediate dedicated Flink releases for log4j vulnerability

2021-12-13 Thread Jing Zhang
+1 for the quick release.

Till Rohrmann  于2021年12月13日周一 17:54写道:

> +1
>
> Cheers,
> Till
>
> On Mon, Dec 13, 2021 at 10:42 AM Jing Ge  wrote:
>
> > +1
> >
> > As I suggested to publish the blog post last week asap, users have been
> > keen to have such urgent releases. Many thanks for it.
> >
> >
> >
> > On Mon, Dec 13, 2021 at 8:29 AM Konstantin Knauf 
> > wrote:
> >
> > > +1
> > >
> > > I didn't think this was necessary when I published the blog post on
> > Friday,
> > > but this has made higher waves than I expected over the weekend.
> > >
> > >
> > >
> > > On Mon, Dec 13, 2021 at 8:23 AM Yuan Mei 
> wrote:
> > >
> > > > +1 for quick release.
> > > >
> > > > On Mon, Dec 13, 2021 at 2:55 PM Martijn Visser <
> mart...@ververica.com>
> > > > wrote:
> > > >
> > > > > +1 to address the issue like this
> > > > >
> > > > > On Mon, 13 Dec 2021 at 07:46, Jingsong Li 
> > > > wrote:
> > > > >
> > > > > > +1 for fixing it in these versions and doing quick releases.
> Looks
> > > good
> > > > > to
> > > > > > me.
> > > > > >
> > > > > > Best,
> > > > > > Jingsong
> > > > > >
> > > > > > On Mon, Dec 13, 2021 at 2:18 PM Becket Qin  >
> > > > wrote:
> > > > > > >
> > > > > > > +1. The solution sounds good to me. There have been a lot of
> > > > inquiries
> > > > > > > about how to react to this.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jiangjie (Becket) Qin
> > > > > > >
> > > > > > > On Mon, Dec 13, 2021 at 12:40 PM Prasanna kumar <
> > > > > > > prasannakumarram...@gmail.com> wrote:
> > > > > > >
> > > > > > > > 1+ for making Updates for 1.12.5 .
> > > > > > > > We are looking for fix in 1.12 version.
> > > > > > > > Please notify once the fix is done.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Dec 13, 2021 at 9:45 AM Leonard Xu <
> xbjt...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 for the quick release and the special vote period 24h.
> > > > > > > > >
> > > > > > > > > > 2021年12月13日 上午11:49,Dian Fu  写道:
> > > > > > > > > >
> > > > > > > > > > +1 for the proposal and creating a quick release.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Dian
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 13, 2021 at 11:15 AM Kyle Bendickson <
> > > > > k...@tabular.io>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> +1 to doing a release for this widely publicized
> > > > vulnerability.
> > > > > > > > > >>
> > > > > > > > > >> In my experience, users will often update to the latest
> > > minor
> > > > > > patch
> > > > > > > > > version
> > > > > > > > > >> without much fuss. Plus, users have also likely heard
> > about
> > > > this
> > > > > > and
> > > > > > > > > will
> > > > > > > > > >> appreciate a simple fix (updating their version where
> > > > possible).
> > > > > > > > > >>
> > > > > > > > > >> The work-around will need to still be noted for users
> who
> > > > can’t
> > > > > > > > upgrade
> > > > > > > > > for
> > > > > > > > > >> whatever reason (EMR hasn’t caught up, etc).
> > > > > > > > > >>
> > > > > > > > > >> I also agree with your assessment to apply a patch on
> each
> > > of
> > > > > > those
> > > > > > > > > >> previous versions with only the log4j commit, so that
> they
> > > > don’t
> > > > > > need
> > > > > > > > > to be
> > > > > > > > > >> as rigorously tested.
> > > > > > > > > >>
> > > > > > > > > >> Best,
> > > > > > > > > >> Kyle (GitHub @kbendick)
> > > > > > > > > >>
> > > > > > > > > >> On Sun, Dec 12, 2021 at 2:23 PM Stephan Ewen <
> > > > se...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Hi all!
> > > > > > > > > >>>
> > > > > > > > > >>> Without doubt, you heard about the log4j vulnerability
> > [1].
> > > > > > > > > >>>
> > > > > > > > > >>> There is an advisory blog post on how to mitigate this
> in
> > > > > Apache
> > > > > > > > Flink
> > > > > > > > > >> [2],
> > > > > > > > > >>> which involves setting a config option and restarting
> the
> > > > > > processes.
> > > > > > > > > That
> > > > > > > > > >>> is fortunately a relatively simple fix.
> > > > > > > > > >>>
> > > > > > > > > >>> Despite this workaround, I think we should do an
> > immediate
> > > > > > release
> > > > > > > > with
> > > > > > > > > >> the
> > > > > > > > > >>> updated dependency. Meaning not waiting for the next
> bug
> > > fix
> > > > > > releases
> > > > > > > > > >>> coming in a few weeks, but releasing asap.
> > > > > > > > > >>> The mood I perceive in the industry is pretty much
> > panicky
> > > > over
> > > > > > this,
> > > > > > > > > >> and I
> > > > > > > > > >>> expect we will see many requests for a patched release
> > and
> > > > many
> > > > > > > > > >> discussions
> > > > > > > > > >>> why the workaround alone would not be enough due to
> > certain
> > > > > > > > guidelines.
> > > > > > > > > >>>
> > > > > > > > > >>> I suggest that we preempt those discussions and create
> > > > releases
> > > > > > the
> > > > > > > > > >>> 

Re: [DISCUSS] FLIP-190: Support Version Upgrades for Table API & SQL Programs

2021-12-13 Thread Jing Zhang
Hi Timo,

+1 for the improvement too.
Please count me in when assign subtasks.

Best,
Jing Zhang

Timo Walther  于2021年12月13日周一 17:00写道:

> Hi everyone,
>
> *last call for feedback* on this FLIP. Otherwise I would start a VOTE by
> tomorrow.
>
> @Wenlong: Thanks for offering your help. Once the FLIP has been
> accepted. I will create a list of subtasks that we can split among
> contributors. Many can be implemented in parallel.
>
> Regards,
> Timo
>
>
> On 13.12.21 09:20, wenlong.lwl wrote:
> > Hi, Timo, +1 for the improvement too. Thanks for the great job.
> >
> > Looking forward to the next progress of the FLIP, I could help on the
> > development of some of the specific improvements.
> >
> > Best,
> > Wenlong
> >
> > On Mon, 13 Dec 2021 at 14:43, godfrey he  wrote:
> >
> >> Hi Timo,
> >>
> >> +1 for the improvement.
> >>
> >> Best,
> >> Godfrey
> >>
> >> Timo Walther  于2021年12月10日周五 20:37写道:
> >>>
> >>> Hi Wenlong,
> >>>
> >>> yes it will. Sorry for the confusion. This is a logical consequence of
> >>> the assumption:
> >>>
> >>> The JSON plan contains no implementation details (i.e. no classes) and
> >>> is fully declarative.
> >>>
> >>> I will add a remark.
> >>>
> >>> Thanks,
> >>> Timo
> >>>
> >>>
> >>> On 10.12.21 11:43, wenlong.lwl wrote:
> >>>> hi, Timo, thanks for the explanation. I totally agree with what you
> >> said.
> >>>> My actual question is: Will the version of an exec node be serialised
> >> in
> >>>> the Json Plan? In my understanding, it is not in the former design. If
> >> it
> >>>> is yes, my question is solved already.
> >>>>
> >>>>
> >>>> Best,
> >>>> Wenlong
> >>>>
> >>>>
> >>>> On Fri, 10 Dec 2021 at 18:15, Timo Walther 
> wrote:
> >>>>
> >>>>> Hi Wenlong,
> >>>>>
> >>>>> also thought about adding a `flinkVersion` field per ExecNode. But
> >> this
> >>>>> is not necessary, because the `version` of the ExecNode has the same
> >>>>> purpose.
> >>>>>
> >>>>> The plan version just encodes that:
> >>>>> "plan has been updated in Flink 1.17" / "plan is entirely valid for
> >>>>> Flink 1.17"
> >>>>>
> >>>>> The ExecNode version maps to `minStateVersion` to verify state
> >>>>> compatibility.
> >>>>>
> >>>>> So even if the plan version is 1.17, some ExecNodes use state layout
> >> of
> >>>>> 1.15.
> >>>>>
> >>>>> It is totally fine to only update the ExecNode to version 2 and not 3
> >> in
> >>>>> your example.
> >>>>>
> >>>>> Regards,
> >>>>> Timo
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 10.12.21 06:02, wenlong.lwl wrote:
> >>>>>> Hi, Timo, thanks for updating the doc.
> >>>>>>
> >>>>>> I have a comment on plan migration:
> >>>>>> I think we may need to add a version field for every exec node when
> >>>>>> serialising. In earlier discussions, I think we have a conclusion
> >> that
> >>>>>> treating the version of plan as the version of node, but in this
> >> case it
> >>>>>> would be broken.
> >>>>>> Take the following example in FLIP into consideration, there is a
> bad
> >>>>> case:
> >>>>>> when in 1.17, we introduced an incompatible version 3 and dropped
> >> version
> >>>>>> 1, we can only update the version to 2, so the version should be per
> >> exec
> >>>>>> node.
> >>>>>>
> >>>>>> ExecNode version *1* is not supported anymore. Even though the state
> >> is
> >>>>>> actually compatible. The plan restore will fail with a helpful
> >> exception
> >>>>>> that forces users to perform plan migration.
> >>>>>>
> >>>>>> COMPILE PLAN '/mydir/plan_new.json' FROM '/mydir/plan_old.json';
> >>

[jira] [Created] (FLINK-25258) Update log4j2 version to avoid

2021-12-10 Thread Jing Zhang (Jira)
Jing Zhang created FLINK-25258:
--

 Summary: Update log4j2 version to avoid 
 Key: FLINK-25258
 URL: https://issues.apache.org/jira/browse/FLINK-25258
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.14.0, 1.13.0, 1.12.0, 1.11.0
Reporter: Jing Zhang
 Fix For: 1.15.0


2.0 <= Apache log4j2 <= 2.14.1 have a RCE zero day.

https://www.cyberkendra.com/2021/12/worst-log4j-rce-zeroday-dropped-on.html

https://www.lunasec.io/docs/blog/log4j-zero-day/

Flink has switched to Log4j 2 since 1.11 version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] FLIP-190: Support Version Upgrades for Table API & SQL Programs

2021-12-08 Thread Jing Zhang
Hi Timo,
Thanks a lot for driving this discussion.
I believe it could solve many problems what we are suffering in upgrading.

I only have a little complain on the following point.

> For simplification of the design, we assume that upgrades use a step size
of a single minor version. We don't guarantee skipping minor versions (e.g.
1.11 to
1.14).

In our internal production environment, we follow up with the community's
latest stable release version almost once a year because upgrading to a new
version is a time-consuming process.
So we might missed 1~3 version after we upgrade to the latest version. This
might also appears in other company too.
Could we guarantee FLIP-190 work if we skip minor versions less than
specified threshold?
Then we could know which version is good for us when prepare upgrading.

Best,
Jing Zhang

godfrey he  于2021年12月8日周三 22:16写道:

> Hi Timo,
>
> Thanks for the explanation, it's much clearer now.
>
> One thing I want to confirm about `supportedPlanFormat `
> and `supportedSavepointFormat `:
> `supportedPlanFormat ` supports multiple versions,
> while `supportedSavepointFormat ` supports only one version ?
> A json plan  can be deserialized by multiple versions
> because default value will be set for new fields.
> In theory, a Savepoint can be restored by more than one version
> of the operators even if a state layout is changed,
> such as deleting a whole state and starting job with
> `allowNonRestoredState`=true.
> I think this is a corner case, and it's hard to understand comparing
> to `supportedPlanFormat ` supporting multiple versions.
> So, for most cases, when the state layout is changed, the savepoint is
> incompatible,
> and `supportedSavepointFormat` and version need to be changed.
>
> I think we need a detail explanation about the annotations change story in
> the java doc of  `ExecNodeMetadata` class for all developers
> (esp. those unfamiliar with this part).
>
> Best,
> Godfrey
>
> Timo Walther  于2021年12月8日周三 下午4:57写道:
> >
> > Hi Wenlong,
> >
> > thanks for the feedback. Great that we reached consensus here. I will
> > update the entire document with my previous example shortly.
> >
> >  > if we don't update the version when plan format changes, we can't
> > find that the plan can't not be deserialized in 1.15
> >
> > This should not be a problem as the entire plan file has a version as
> > well. We should not allow reading a 1.16 plan in 1.15. We can throw a
> > helpful exception early.
> >
> > Reading a 1.15 plan in 1.16 is possible until we drop the old
> > `supportedPlanFormat` from one of used ExecNodes. Afterwards all
> > `supportedPlanFormat` of ExecNodes must be equal or higher then the plan
> > version.
> >
> > Regards,
> > Timo
> >
> > On 08.12.21 03:07, wenlong.lwl wrote:
> > > Hi, Timo,  +1 for multi metadata.
> > >
> > > The compatible change I mean in the last email is the slight state
> change
> > > example you gave, so we have got  consensus on this actually, IMO.
> > >
> > > Another question based on the example you gave:
> > > In the example "JSON node gets an additional property in 1.16", if we
> don't
> > > update the version when plan format changes, we can't find that the
> plan
> > > can't not be deserialized in 1.15, although the savepoint state is
> > > compatible.
> > > The error message may be not so friendly if we just throw
> deserialization
> > > failure.
> > >
> > > On Tue, 7 Dec 2021 at 16:49, Timo Walther  wrote:
> > >
> > >> Hi Wenlong,
> > >>
> > >>   > First,  we add a newStateLayout because of some improvement in
> state, in
> > >>   > order to keep compatibility we may still keep the old state for
> the
> > >> first
> > >>   > version. We need to update the version, so that we can generate a
> new
> > >>   > version plan for the new job and keep the exec node compatible
> with
> > >> the old
> > >>   > version plan.
> > >>
> > >> The problem that I see here for contributors is that the actual update
> > >> of a version is more complicated than just updating an integer value.
> It
> > >> means copying a lot of ExecNode code for a change that happens locally
> > >> in an operator. Let's assume multiple ExecNodes use a similar
> operator.
> > >> Why do we need to update all ExecNode versions, if the operator itself
> > >> can deal with the incompatibility. The ExecNode version is meant for
> > >> topology ch

Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk

2021-12-02 Thread Jing Zhang
Congratulations, Ingo!

刘建刚  于2021年12月3日周五 11:52写道:

> Congratulations!
>
> Best,
> Liu Jiangang
>
> Till Rohrmann  于2021年12月2日周四 下午11:24写道:
>
> > Hi everyone,
> >
> > On behalf of the PMC, I'm very happy to announce Ingo Bürk as a new Flink
> > committer.
> >
> > Ingo has started contributing to Flink since the beginning of this year.
> He
> > worked mostly on SQL components. He has authored many PRs and helped
> review
> > a lot of other PRs in this area. He actively reported issues and helped
> our
> > users on the MLs. His most notable contributions were Support SQL 2016
> JSON
> > functions in Flink SQL (FLIP-90), Register sources/sinks in Table API
> > (FLIP-129) and various other contributions in the SQL area. Moreover, he
> is
> > one of the few people in our community who actually understands Flink's
> > frontend.
> >
> > Please join me in congratulating Ingo for becoming a Flink committer!
> >
> > Cheers,
> > Till
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Matthias Pohl

2021-12-02 Thread Jing Zhang
Congratulations, Matthias!

刘建刚  于2021年12月3日周五 11:51写道:

> Congratulations!
>
> Best,
> Liu Jiangang
>
> Till Rohrmann  于2021年12月2日周四 下午11:28写道:
>
> > Hi everyone,
> >
> > On behalf of the PMC, I'm very happy to announce Matthias Pohl as a new
> > Flink committer.
> >
> > Matthias has worked on Flink since August last year. He helped review a
> ton
> > of PRs. He worked on a variety of things but most notably the tracking
> and
> > reporting of concurrent exceptions, fixing HA bugs and deprecating and
> > removing our Mesos support. He actively reports issues helping Flink to
> > improve and he is actively engaged in Flink's MLs.
> >
> > Please join me in congratulating Matthias for becoming a Flink committer!
> >
> > Cheers,
> > Till
> >
>


Re: [ANNOUNCE] Open source of remote shuffle project for Flink batch data processing

2021-11-30 Thread Jing Zhang
Amazing!
Remote shuffle service is an important improvement for batch data
processing experience on Flink.
It is also a strong requirement in our internal batch business, we would
try it soon and give you feedback.

Best,
Jing Zhang

Martijn Visser  于2021年12月1日周三 上午3:25写道:

> Hi Yingjie,
>
> This is great, thanks for sharing. Will you also add it to
> https://flink-packages.org/ ?
>
> Best regards,
>
> Martijn
>
> On Tue, 30 Nov 2021 at 17:31, Till Rohrmann  wrote:
>
> > Great news, Yingjie. Thanks a lot for sharing this information with the
> > community and kudos to all the contributors of the external shuffle
> service
> > :-)
> >
> > Cheers,
> > Till
> >
> > On Tue, Nov 30, 2021 at 2:32 PM Yingjie Cao 
> > wrote:
> >
> > > Hi dev & users,
> > >
> > > We are happy to announce the open source of remote shuffle project [1]
> > for
> > > Flink. The project is originated in Alibaba and the main motivation is
> to
> > > improve batch data processing for both performance & stability and
> > further
> > > embrace cloud native. For more features about the project, please refer
> > to
> > > [1].
> > >
> > > Before going open source, the project has been used widely in
> production
> > > and it behaves well on both stability and performance. We hope you
> enjoy
> > > it. Collaborations and feedbacks are highly appreciated.
> > >
> > > Best,
> > > Yingjie on behalf of all contributors
> > >
> > > [1] https://github.com/flink-extended/flink-remote-shuffle
> > >
> >
>


Re: [VOTE][FLIP-195] Improve the name and structure of vertex and operator name for job

2021-11-23 Thread Jing Zhang
+1 (non-binding)

Best,
Jing Zhang

Martijn Visser  于2021年11月23日周二 下午7:42写道:

> +1 (non-binding)
>
> On Tue, 23 Nov 2021 at 12:13, Aitozi  wrote:
>
> > +1 (non-binding)
> >
> > Best,
> > Aitozi
> >
> > wenlong.lwl  于2021年11月23日周二 下午4:00写道:
> >
> > > Hi everyone,
> > >
> > > Based on the discussion[1], we seem to have consensus, so I would like
> to
> > > start a vote on FLIP-195 [2].
> > > Thanks for all of your feedback.
> > >
> > > The vote will last for at least 72 hours (Nov 26th 16:00 GMT) unless
> > > there is an objection or insufficient votes.
> > >
> > > [1] https://lists.apache.org/thread/kvdxr8db0l5s6wk7hwlt0go5fms99b8t
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-195%3A+Improve+the+name+and+structure+of+vertex+and+operator+name+for+job
> > >
> > > Best,
> > > Wenlong Lyu
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Yingjie Cao

2021-11-17 Thread Jing Zhang
Congratulations!

Best,
Jing Zhang

Dian Fu  于2021年11月17日周三 下午8:07写道:

> Congratulations!
>
> Regards,
> Dian
>
> On Wed, Nov 17, 2021 at 5:14 PM Yang Wang  wrote:
>
> > Congratulations, Yiingjie!
> >
> > Best,
> > Yang
> >
> > Zakelly Lan  于2021年11月17日周三 下午4:26写道:
> >
> > > Congratulations, Yiingjie  :)
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Wed, Nov 17, 2021 at 2:58 PM Lincoln Lee 
> > > wrote:
> > >
> > > > Congratulations!
> > > >
> > > > Best,
> > > > Lincoln Lee
> > > >
> > > >
> > > > Zhilong Hong  于2021年11月17日周三 下午2:23写道:
> > > >
> > > > > Congratulations, Yiingjie!
> > > > >
> > > > >
> > > > > Best regards,
> > > > > Zhilong
> > > > >
> > > > > On Wed, Nov 17, 2021 at 2:13 PM Yuepeng Pan 
> wrote:
> > > > >
> > > > > > Congratulations !
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Yuepeng Pan.
> > > > > >
> > > > > >
> > > > > > At 2021-11-17 12:55:29, "Guowei Ma" 
> wrote:
> > > > > > >Hi everyone,
> > > > > > >
> > > > > > >On behalf of the PMC, I'm very happy to announce Yingjie Cao as
> a
> > > new
> > > > > > Flink
> > > > > > >committer.
> > > > > > >
> > > > > > >Yingjie has submitted 88 PRs since he joined the Flink community
> > for
> > > > > more
> > > > > > >than 2 years. In general, his main contributions are
> concentrated
> > in
> > > > > > >Flink's Shuffle. Yingjie has done a lot of work in promoting the
> > > > > > >performance and stability of TM Blocking Shuffle[1][2];In order
> to
> > > > allow
> > > > > > >Flink to use External/Remote Shuffle Service in batch production
> > > > > > scenarios,
> > > > > > >he also improves the Flink Shuffle architecture in 1.14 [3]. At
> > the
> > > > same
> > > > > > >time, he has also done some related work in Streaming Shuffle,
> > such
> > > as
> > > > > > >Buffer Management[4] , Non-Blocking Network Output and some
> > > important
> > > > > bug
> > > > > > >fixes.
> > > > > > >
> > > > > > >Please join me in congratulating Yingjie for becoming a Flink
> > > > committer!
> > > > > > >
> > > > > > >[1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-148%3A+Introduce+Sort-Based+Blocking+Shuffle+to+Flink
> > > > > > >[2] https://flink.apache.org/2021/10/26/sort-shuffle-part1.html
> > > > > > >[3]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-184%3A+Refine+ShuffleMaster+lifecycle+management+for+pluggable+shuffle+service+framework
> > > > > > >[4] https://issues.apache.org/jira/browse/FLINK-16428
> > > > > > >
> > > > > > >Best,
> > > > > > >Guowei
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Jing Zhang

2021-11-16 Thread Jing Zhang
Thanks to everyone. It's my honor to work in community with you all.

Best,
Jing Zhang

Zakelly Lan  于2021年11月17日周三 上午12:06写道:

> Congratulations,  Jing!
>
> Best,
> Zakelly
>
> On Tue, Nov 16, 2021 at 11:03 PM Yang Wang  wrote:
>
>> Congratulations,  Jing!
>>
>> Best,
>> Yang
>>
>> Benchao Li  于2021年11月16日周二 下午9:31写道:
>>
>> > Congratulations Jing~
>> >
>> > OpenInx  于2021年11月16日周二 下午1:58写道:
>> >
>> > > Congrats Jing!
>> > >
>> > > On Tue, Nov 16, 2021 at 11:59 AM Terry Wang 
>> wrote:
>> > >
>> > > > Congratulations,  Jing!
>> > > > Well deserved!
>> > > >
>> > > > Best,
>> > > > Terry Wang
>> > > >
>> > > >
>> > > >
>> > > > > 2021年11月16日 上午11:27,Zhilong Hong  写道:
>> > > > >
>> > > > > Congratulations, Jing!
>> > > > >
>> > > > > Best regards,
>> > > > > Zhilong Hong
>> > > > >
>> > > > > On Mon, Nov 15, 2021 at 9:41 PM Martijn Visser <
>> > mart...@ververica.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Congratulations Jing!
>> > > > >>
>> > > > >> On Mon, 15 Nov 2021 at 14:39, Timo Walther 
>> > > wrote:
>> > > > >>
>> > > > >>> Hi everyone,
>> > > > >>>
>> > > > >>> On behalf of the PMC, I'm very happy to announce Jing Zhang as a
>> > new
>> > > > >>> Flink committer.
>> > > > >>>
>> > > > >>> Jing has been very active in the Flink community esp. in the
>> > > Table/SQL
>> > > > >>> area for quite some time: 81 PRs [1] in total and is also
>> active on
>> > > > >>> answering questions on the user mailing list. She is currently
>> > > > >>> contributing a lot around the new windowing table-valued
>> functions
>> > > [2].
>> > > > >>>
>> > > > >>> Please join me in congratulating Jing Zhang for becoming a Flink
>> > > > >> committer!
>> > > > >>>
>> > > > >>> Thanks,
>> > > > >>> Timo
>> > > > >>>
>> > > > >>> [1] https://github.com/apache/flink/pulls/beyond1920
>> > > > >>> [2] https://issues.apache.org/jira/browse/FLINK-23997
>> > > > >>>
>> > > > >>
>> > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> >
>> > Best,
>> > Benchao Li
>> >
>>
>


Re: [ANNOUNCE] New Apache Flink Committer - Fabian Paul

2021-11-15 Thread Jing Zhang
Congratulations Fabian!

Best,
Jing Zhang

Yuepeng Pan  于2021年11月16日周二 上午10:16写道:

> Congratulations Fabian!
>
> Best,
> Yuepeng Pan
>
> At 2021-11-15 21:17:13, "Arvid Heise"  wrote:
> >Hi everyone,
> >
> >On behalf of the PMC, I'm very happy to announce Fabian Paul as a new
> Flink
> >committer.
> >
> >Fabian Paul has been actively improving the connector ecosystem by
> >migrating Kafka and ElasticSearch to the Sink interface and is currently
> >driving FLIP-191 [1] to tackle the sink compaction issue. While he is
> >active on the project (authored 70 PRs and reviewed 60), it's also worth
> >highlighting that he has also been guiding external efforts, such as the
> >DeltaLake Flink connector or the Pinot sink in Bahir.
> >
> >Please join me in congratulating Fabian for becoming a Flink committer!
> >
> >Best,
> >
> >Arvid
> >
> >[1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction
>


Re: [ANNOUNCE] New Apache Flink Committer - Yangze Guo

2021-11-11 Thread JING ZHANG
Congrats!

Best,
Jing Zhang

Lijie Wang  于2021年11月12日周五 下午1:25写道:

> Congrats Yangze!
>
> Best,
> Lijie
>
> Yuan Mei  于2021年11月12日周五 下午1:10写道:
>
> > Congrats Yangze!
> >
> > Best
> > Yuan
> >
> > On Fri, Nov 12, 2021 at 11:57 AM Leonard Xu  wrote:
> >
> > > Congrats & well deserved, Yangze!
> > >
> > > Best,
> > > Leonard
> > >
> > > > 在 2021年11月12日,11:51,godfrey he  写道:
> > > >
> > > > Congrats, Yangze!
> > > >
> > > > Best,
> > > > Godfrey
> > > >
> > > > Yuepeng Pan  于2021年11月12日周五 上午10:49写道:
> > > >>
> > > >> Congrats.
> > > >>
> > > >>
> > > >> Best,
> > > >> Yuepeng Pan.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> At 2021-11-12 10:10:43, "Xintong Song" 
> wrote:
> > > >>> Hi everyone,
> > > >>>
> > > >>> On behalf of the PMC, I'm very happy to announce Yangze Guo as a
> new
> > > Flink
> > > >>> committer.
> > > >>>
> > > >>> Yangze has been consistently contributing to this project for
> almost
> > 3
> > > >>> years. His contributions are mainly in the resource management and
> > > >>> deployment areas, represented by the fine-grained resource
> management
> > > and
> > > >>> external resource framework. In addition to feature works, he's
> also
> > > active
> > > >>> in miscellaneous contributions, including PR reviews, document
> > > enhancement,
> > > >>> mailing list services and meetup/FF talks.
> > > >>>
> > > >>> Please join me in congratulating Yangze Guo for becoming a Flink
> > > committer!
> > > >>>
> > > >>> Thank you~
> > > >>>
> > > >>> Xintong Song
> > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Leonard Xu

2021-11-11 Thread JING ZHANG
Congrat!

Best,
Jing Zhang

Lijie Wang  于2021年11月12日周五 下午1:26写道:

> Congrats Leonard!
>
> Best,
> Lijie
>
> Yuan Mei  于2021年11月12日周五 下午1:11写道:
>
> > Congrats Leonard!
> >
> > Best
> > Yuan
> >
> >
> > On Fri, Nov 12, 2021 at 12:21 PM Paul Lam  wrote:
> >
> > > Congrats! Well deserved!
> > >
> > > Best,
> > > Paul Lam
> > >
> > > > 2021年11月12日 12:16,Yangze Guo  写道:
> > > >
> > > > Congrats, Leonard!
> > > >
> > > > Jingsong Li  于 2021年11月12日周五 下午12:15写道:
> > > >
> > > >> Congrats & well deserved, Leonard!
> > > >>
> > > >> Thank you for your exciting contributions to Flink!
> > > >>
> > > >> Best,
> > > >> Jingsong
> > > >>
> > > >> On Fri, Nov 12, 2021 at 12:13 PM Jark Wu  wrote:
> > > >>>
> > > >>> Hi everyone,
> > > >>>
> > > >>> On behalf of the PMC, I'm very happy to announce Leonard Xu as a
> new
> > > >> Flink
> > > >>> committer.
> > > >>>
> > > >>> Leonard has been a very active contributor for more than two year,
> > > >> authored
> > > >>> 150+ PRs and reviewed many PRs which is quite outstanding.
> > > >>> Leonard mainly works on Flink SQL parts and drives several
> important
> > > >> FLIPs,
> > > >>> e.g. FLIP-132 (temporal table join) and FLIP-162 (correct time
> > > >> behaviors).
> > > >>> He is also the maintainer of flink-cdc-connectors[1] project which
> > > helps
> > > >> a
> > > >>> lot for users building a real-time data warehouse and data lake.
> > > >>>
> > > >>> Please join me in congratulating Leonard for becoming a Flink
> > > committer!
> > > >>>
> > > >>> Cheers,
> > > >>> Jark Wu
> > > >>>
> > > >>> [1]: https://github.com/ververica/flink-cdc-connectors
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Best, Jingsong Lee
> > > >>
> > >
> > >
> >
>


Re: [VOTE] FLIP-188 Introduce Built-in Dynamic Table Storage

2021-11-11 Thread JING ZHANG
+1 (non-binding)

A small suggestion:
The message queue is currently used to store middle layer data of the
streaming data warehouse. We hope use built-in dynamic table storage to
store those middle layer.
But those middle data of the streaming data warehouse are often provided to
all business teams in a company. Some teams have not use Apache Flink as
compute engine yet. In order to continue server those teams, the data in
built-in dynamic table storage may be needed to copied to message queue
again.
If *the built-in storage could provide same consumer API as the commonly
used message queues*, data copying may be avoided. So the built-in dynamic
table storage may be promoted faster in the streaming data warehouse
business.

Best regards,
Jing Zhang

Yufei Zhang  于2021年11月11日周四 上午9:34写道:

> Hi,
>
> +1 (non-binding)
>
> Very interesting design. I saw a lot of discussion on the generic
> interface design, good to know it will address extensibility.
>
> Cheers,
> Yufei
>
>
> On 2021/11/10 02:51:55 Jingsong Li wrote:
> > Hi everyone,
> >
> > Thanks for all the feedback so far. Based on the discussion[1] we seem
> > to have consensus, so I would like to start a vote on FLIP-188 for
> > which the FLIP has now also been updated[2].
> >
> > The vote will last for at least 72 hours (Nov 13th 3:00 GMT) unless
> > there is an objection or insufficient votes.
> >
> > [1] https://lists.apache.org/thread/tqyn1cro5ohl3c3fkjb1zvxbo03sofn7
> > [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage
> >
> > Best,
> > Jingsong
> >
>


Re: [DISCUSS] Improve the name and structure of job vertex and operator name for job

2021-11-11 Thread JING ZHANG
Big +1.

This is a problem frequently encountered in our production platform, look
forward to this improvement.

Best,
Jing Zhang

Martijn Visser  于2021年11月11日周四 下午6:26写道:

> +1. Looks much better now
>
> On Thu, 11 Nov 2021 at 11:07, godfrey he  wrote:
>
> > Thanks for driving this, this improvement solves a long-complained
> > problem, +1
> >
> > Best,
> > Godfrey
> >
> > Jark Wu  于2021年11月11日周四 下午5:40写道:
> > >
> > > +1 for this. It looks much more clear and structured.
> > >
> > > Best,
> > > Jark
> > >
> > > On Thu, 11 Nov 2021 at 17:23, Chesnay Schepler 
> > wrote:
> > >
> > > > I'm generally in favor of it, and there are already tickets that
> > > > proposed a dedicated operator/vertex description:
> > > >
> > > > https://issues.apache.org/jira/browse/FLINK-20388
> > > > https://issues.apache.org/jira/browse/FLINK-21858
> > > >
> > > > On 11/11/2021 10:02, wenlong.lwl wrote:
> > > > > Hi, all, I would like to start a discussion about an improvement on
> > name
> > > > > and structure of job vertex name, mainly to improve experience of
> > > > debugging
> > > > > and analyzing sql job at runtime.
> > > > >
> > > > > the main proposed changes including:
> > > > > 1. separate description and name for operator, so that we can have
> > > > detailed
> > > > > info at description and shorter name, which could be more friendly
> > for
> > > > > external systems like logging/metrics without losing useful
> > information.
> > > > > 2. introduce a tree-mode vertex description which can make the
> > > > description
> > > > > more readable and easier to understand
> > > > > 3. clean up and improve description for sql operator
> > > > >
> > > > > here is an example with the changes for a sql job:
> > > > >
> > > > > vertex name:
> > > > > GlobalGroupAggregate[52] -> (Calc[53] -> NotNullEnforcer[54] ->
> Sink:
> > > > > tb_ads_dwi_pub_hbd_spm_dtr_002_003[54], Calc[55] ->
> > NotNullEnforcer[56]
> > > > ->
> > > > > Sink: tb_ads_dwi_pub_hbd_spm_dtr_002_004[56])
> > > > > vertex description:
> > > > > [52]:GlobalGroupAggregate(groupBy=[stat_date, spm_url_ab, client],
> > > > > select=[stat_date, spm_url_ab, client, COUNT(count1$0) AS
> > > > > clk_cnt_app_mtr_001, COUNT(distinct$0 count$1) AS
> clk_uv_app_mtr_001,
> > > > > COUNT(count1$2) AS clk_cnt_app_mtr_002, COUNT(distinct$0 count$3)
> AS
> > > > > clk_uv_app_mtr_002, COUNT(count1$4) AS clk_cnt_app_mtr_003,
> > > > > COUNT(distinct$0 count$5) AS clk_uv_app_mtr_003]) :-
> > > > > [53]:Calc(select=[CASE((client <> ''), CONCAT_WS('\u0004',
> > > > > CONCAT(SUBSTRING(MD5(CONCAT(spm_url_ab, '12345')), 1, 4), ':md5'),
> > > > > CONCAT(spm_url_ab, ':spmab'), '12345:app', CONCAT(client,
> ':client'),
> > > > > CONCAT('ddd:', stat_date)), null:VARCHAR(2147483647)) AS rowkey,
> > > > > clk_cnt_app_mtr_001 AS clk_cnt_app_dtr_001, clk_uv_app_mtr_001 AS
> > > > > clk_uv_app_dtr_001, clk_cnt_app_mtr_002 AS clk_cnt_app_dtr_002,
> > > > > clk_uv_app_mtr_002 AS clk_uv_app_dtr_002, clk_cnt_app_mtr_003 AS
> > > > > clk_cnt_app_dtr_003, clk_uv_app_mtr_003 AS clk_uv_app_dtr_003]) :
> +-
> > > > > [54]:NotNullEnforcer(fields=[rowkey]) : +-
> > > > >
> > > >
> >
> [54]:Sink(table=[default_catalog.default_database.tb_ads_dwi_pub_hbd_spm_dtr_002_003],
> > > > > fields=[rowkey, clk_cnt_app_dtr_001, clk_uv_app_dtr_001,
> > > > > clk_cnt_app_dtr_002, clk_uv_app_dtr_002, clk_cnt_app_dtr_003,
> > > > > clk_uv_app_dtr_003]) +- [55]:Calc(select=[CASE((client <> ''),
> > > > > CONCAT_WS('\u0004', CONCAT(SUBSTRING(MD5(CONCAT(spm_url_ab,
> > '12345')), 1,
> > > > > 4), ':md5'), CONCAT(spm_url_ab, ':spmab'), '12345:app',
> > CONCAT('ddd:',
> > > > > stat_date), CONCAT(client, ':client')), (client = ''),
> > > > CONCAT_WS('\u0004',
> > > > > CONCAT(SUBSTRING(MD5(CONCAT(spm_url_ab, '92459')), 1, 4), ':md5'),
> > > > > CONCAT(spm_url_ab, ':spmab'), '92459:app', CONCAT('ddd:',
> > stat_date)),
> > > > > null:VARCHAR(2147483647)) AS rowkey, clk_cnt_app_mtr_001 AS
> > > > > clk_cnt_app_dtr_001, clk_uv_app_mtr_001 AS clk_uv_app_dtr_001,
> > > > > clk_cnt_app_mtr_002 AS clk_cnt_app_dtr_002, clk_uv_app_mtr_002 AS
> > > > > clk_uv_app_dtr_002, clk_cnt_app_mtr_003 AS clk_cnt_app_dtr_003,
> > > > > clk_uv_app_mtr_003 AS clk_uv_app_dtr_003]) +-
> > > > > [56]:NotNullEnforcer(fields=[rowkey]) +-
> > > > >
> > > >
> >
> [56]:Sink(table=[default_catalog.default_database.tb_ads_dwi_pub_hbd_spm_dtr_002_004],
> > > > > fields=[rowkey, clk_cnt_app_dtr_001, clk_uv_app_dtr_001,
> > > > > clk_cnt_app_dtr_002, clk_uv_app_dtr_002, clk_cnt_app_dtr_003,
> > > > > clk_uv_app_dtr_003])
> > > > >
> > > > > For more detail on the proposal:
> > > > >
> > > >
> >
> https://docs.google.com/document/d/1VUVJeHY_We09GY53-K2lETP3HUNZG9wMKyecFWk_Wxk
> > > > > <
> > > >
> >
> https://docs.google.com/document/d/1VUVJeHY_We09GY53-K2lETP3HUNZG9wMKyecFWk_Wxk/edit#
> > > > >
> > > > >
> > > > > Looking forward to your feedback, thanks.
> > > > >
> > > > > Bests
> > > > >
> > > > > Wenlong Lyu
> > > > >
> > > >
> > > >
> >
>


Re: [VOTE] FLIP-189: SQL Client Usability Improvements

2021-11-08 Thread JING ZHANG
+1 (non-binding)

Best,
JING ZHANG

Leonard Xu  于2021年11月9日周二 上午10:56写道:

> +1 (non-binding)
>
> Thanks for the Improvements.
>
> Best,
> Leonard
> > 在 2021年11月9日,10:08,Yuepeng Pan  写道:
> >
> > +1 (non-binding)Best,
> > Roc
> >
> > At 2021-11-09 09:13:58, "Jark Wu"  wrote:
> >> +1 (binding)
> >>
> >> Best,
> >> Jark
> >>
> >> On Mon, 8 Nov 2021 at 23:57, Till Rohrmann 
> wrote:
> >>
> >>> +1 (binding)
> >>>
> >>> Cheers,
> >>> Till
> >>>
> >>> On Fri, Nov 5, 2021 at 9:24 PM Ingo Bürk  wrote:
> >>>
> >>>> +1 (non-binding)
> >>>>
> >>>> I think it has all been said before. :-)
> >>>>
> >>>> On Fri, Nov 5, 2021, 17:45 Konstantin Knauf 
> wrote:
> >>>>
> >>>>> +1 (binding)
> >>>>>
> >>>>> On Fri, Nov 5, 2021 at 4:03 PM Martijn Visser  >
> >>>>> wrote:
> >>>>>
> >>>>>> +1 (non-binding). Looking forward!
> >>>>>>
> >>>>>> Best regards,
> >>>>>>
> >>>>>> Martijn
> >>>>>>
> >>>>>> On Fri, 5 Nov 2021 at 15:36, Timo Walther 
> >>> wrote:
> >>>>>>
> >>>>>>> +1 (binding) thanks for working on this.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Timo
> >>>>>>>
> >>>>>>>
> >>>>>>> On 05.11.21 10:14, Sergey Nuyanzin wrote:
> >>>>>>>> Also there is a short demo showing some of the features mentioned
> >>>> in
> >>>>>> this
> >>>>>>>> FLIP.
> >>>>>>>> It is available at https://asciinema.org/a/446247?speed=3.0 (It
> >>>> was
> >>>>>> also
> >>>>>>>> mentioned in [DISCUSS] thread)
> >>>>>>>>
> >>>>>>>> On Wed, Nov 3, 2021 at 11:04 PM Sergey Nuyanzin <
> >>>> snuyan...@gmail.com
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi everyone,
> >>>>>>>>>
> >>>>>>>>> I would like to start a vote on FLIP-189: SQL Client Usability
> >>>>>>>>> Improvements [1].
> >>>>>>>>> The FLIP was discussed in this thread [2].
> >>>>>>>>> FLIP-189 targets usability improvements of SQL Client such as
> >>>>> parsing
> >>>>>>>>> improvement,
> >>>>>>>>> syntax highlighting, completion, prompts
> >>>>>>>>>
> >>>>>>>>> The vote will be open for at least 72 hours unless there is an
> >>>>>> objection
> >>>>>>>>> or not enough votes.
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-189%3A+SQL+Client+Usability+Improvements
> >>>>>>>>> [2]
> >>>>> https://lists.apache.org/thread/8d580jcqzpcbmfwqvhjso82hdd2x0461
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best,
> >>>>>>>>> Sergey
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Konstantin Knauf
> >>>>>
> >>>>> https://twitter.com/snntrable
> >>>>>
> >>>>> https://github.com/knaufk
> >>>>>
> >>>>
> >>>
>
>


Re: I want to contribute to Apache Flink

2021-11-05 Thread JING ZHANG
Hi shiyuquan,
Welcome to join the community.
You do not need to apply for JIRA permissions. You could leave a message
under  a JIRA to indicate that you want to take this issue, and there would
be a committer assign it to you.
Link [1] is a good document for a new contributor. Hope it helps.

[1] https://flink.apache.org/contributing/how-to-contribute.html

Best,
JING ZHANG

shiyuquan  于2021年11月5日周五 下午7:22写道:

> Hi Guys,
>
>
> I want to contribute to Apache Flink.
> Would you please give me the permission as a contributor?  My JIRA ID is
> shiyuquan.
>
>
> Thanks!
>
>


[jira] [Created] (FLINK-24772) Update user document for individual window table-valued function

2021-11-04 Thread JING ZHANG (Jira)
JING ZHANG created FLINK-24772:
--

 Summary: Update user document for individual window table-valued 
function
 Key: FLINK-24772
 URL: https://issues.apache.org/jira/browse/FLINK-24772
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.15.0
Reporter: JING ZHANG






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24760) Update user document for batch window tvf

2021-11-04 Thread JING ZHANG (Jira)
JING ZHANG created FLINK-24760:
--

 Summary: Update user document for batch window tvf
 Key: FLINK-24760
 URL: https://issues.apache.org/jira/browse/FLINK-24760
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.15.0
Reporter: JING ZHANG






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-189: SQL Client Usability Improvements

2021-11-01 Thread JING ZHANG
Amazing improvements and impressive video.
Big +1.

Best,
JING ZHANG

Kurt Young  于2021年11月2日周二 上午9:37写道:

> Really cool improvements @Sergey. Can't wait to see it happen.
>
> Best,
> Kurt
>
>
> On Tue, Nov 2, 2021 at 1:56 AM Martijn Visser 
> wrote:
>
> > Hi Sergey,
> >
> > I guess you've just set a new standard ;-) I agree with Ingo, these
> > improvements look really good!
> >
> > Best regards,
> >
> > Martijn
> >
> > On Mon, 1 Nov 2021 at 18:23, Ingo Bürk  wrote:
> >
> > > Hi Sergey,
> > >
> > > I think those improvements look absolutely amazing. Thanks for the
> little
> > > video!
> > >
> > >
> > > Best
> > > Ingo
> > >
> > > On Mon, Nov 1, 2021, 17:15 Sergey Nuyanzin 
> wrote:
> > >
> > > > Thanks for the feedback Till.
> > > >
> > > > Martijn, I have created a short demo showing some of the features
> > > mentioned
> > > > in FLIP.
> > > > It is available at https://asciinema.org/a/446247?speed=3.0
> > > > Could you please tell if it is what you are expecting or not?
> > > >
> > > > On Fri, Oct 29, 2021 at 4:59 PM Till Rohrmann 
> > > > wrote:
> > > >
> > > > > Thanks for creating this FLIP Sergey. I think what you propose
> sounds
> > > > like
> > > > > very good improvements for the SQL client. This should make the
> > client
> > > a
> > > > > lot more ergonomic :-)
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Oct 29, 2021 at 11:26 AM Sergey Nuyanzin <
> > snuyan...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Martijn,
> > > > > >
> > > > > > Thank you for your suggestion with POC.
> > > > > > Yes I will do that and come back to this thread probably after
> the
> > > > > weekend
> > > > > >
> > > > > > On Thu, Oct 28, 2021 at 4:38 PM Martijn Visser <
> > > mart...@ververica.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Sergey,
> > > > > > >
> > > > > > > Thanks for taking the initiative to create a FLIP and propose
> > > > > > improvements
> > > > > > > on the SQL client. All usability improvements on the SQL client
> > are
> > > > > > highly
> > > > > > > appreciated, especially for new users of Flink. Multi-line
> > support
> > > is
> > > > > > > definitely one of those things I've run into myself.
> > > > > > >
> > > > > > > I do think it would be quite nice if there would be some kind
> of
> > > POC
> > > > > > which
> > > > > > > could show (some of) the proposed improvements. Is that
> something
> > > > that
> > > > > > > might be easily feasible?
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Martijn
> > > > > > >
> > > > > > > On Thu, 28 Oct 2021 at 11:02, Sergey Nuyanzin <
> > snuyan...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I want to start a discussion about FLIP-189: SQL Client
> > Usability
> > > > > > > > Improvements.
> > > > > > > >
> > > > > > > > The main changes in this FLIP:
> > > > > > > >
> > > > > > > > - Flink sql client parser improvements so
> > > > > > > >that sql client does not ask for ; inside a quoted string
> > or a
> > > > > > comment
> > > > > > > > - use prompt to show what sql client is waiting for
> > > > > > > > - introduce syntax highlighting
> > > > > > > > - improve completion
> > > > > > > >
> > > > > > > > For more detailed changes, please refer to FLIP-189[1].
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-189%3A+SQL+Client+Usability+Improvements
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Look forward to your feedback.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Sergey
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Sergey
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Sergey
> > > >
> > >
> >
>


[jira] [Created] (FLINK-24708) `ConvertToNotInOrInRule` has a bug which leads to wrong result

2021-10-29 Thread JING ZHANG (Jira)
JING ZHANG created FLINK-24708:
--

 Summary: `ConvertToNotInOrInRule` has a bug which leads to wrong 
result
 Key: FLINK-24708
 URL: https://issues.apache.org/jira/browse/FLINK-24708
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Reporter: JING ZHANG


A user report this bug in maillist, I paste the content here.

We are in the process of upgrading from Flink 1.9.3 to 1.13.3.  We have noticed 
that statements with either where UPPER(field) or LOWER(field) in combination 
with an IN do not always evaluate correctly. 

 

The following test case highlights this problem.

 


import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Schema;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

public class TestCase {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = 
StreamExecutionEnvironment._getExecutionEnvironment_();
    env.setParallelism(1);

TestData testData = new TestData();
testData.setField1("bcd");
DataStream stream = env.fromElements(testData);
stream.print();  // To prevent 'No operators' error

final StreamTableEnvironment tableEnvironment = 
StreamTableEnvironment._create_(env);
tableEnvironment.createTemporaryView("testTable", stream, 
Schema._newBuilder_().build());

// Fails because abcd is larger than abc
tableEnvironment.executeSql("select *, '1' as run from testTable WHERE 
lower(field1) IN ('abcd', 'abc', 'bcd', 'cde')").print();
// Succeeds because lower was removed
tableEnvironment.executeSql("select *, '2' as run from testTable WHERE 
field1 IN ('abcd', 'abc', 'bcd', 'cde')").print();
// These 4 succeed because the smallest literal is before abcd
tableEnvironment.executeSql("select *, '3' as run from testTable WHERE 
lower(field1) IN ('abc', 'abcd', 'bcd', 'cde')").print();
tableEnvironment.executeSql("select *, '4' as run from testTable WHERE 
lower(field1) IN ('abc', 'bcd', 'abhi', 'cde')").print();
tableEnvironment.executeSql("select *, '5' as run from testTable WHERE 
lower(field1) IN ('cde', 'abcd', 'abc', 'bcd')").print();
tableEnvironment.executeSql("select *, '6' as run from testTable WHERE 
lower(field1) IN ('cde', 'abc', 'abcd', 'bcd')").print();
// Fails because smallest is not first
tableEnvironment.executeSql("select *, '7' as run from testTable WHERE 
lower(field1) IN ('cdef', 'abce', 'abcd', 'ab', 'bcd')").print();
// Succeeds
tableEnvironment.executeSql("select *, '8' as run from testTable WHERE 
lower(field1) IN ('ab', 'cdef', 'abce', 'abcdefgh', 'bcd')").print();

env.execute("TestCase");
}

public static class TestData {
private String field1;

    public String getField1() {
return field1;
}

public void setField1(String field1) {
this.field1 = field1;
    }
    }
}

 

The job produces the following output:

Empty set

++++

| op | field1 |    run |

++++

| +I |    bcd |  2 |

++++

1 row in set

++++

| op | field1 |    run |

++++

| +I |    bcd |  3 |

++++

1 row in set

++++

| op | field1 |    run |

++++

| +I |    bcd |  4 |

++++

1 row in set

++++

| op | field1 |    run |

++++

| +I |    bcd |  5 |

++++

1 row in set

++

Re: [NOTICE] CiBot improvements

2021-10-29 Thread JING ZHANG
Thanks a lot @Chesnay, I would try again.

Chesnay Schepler  于2021年10月29日周五 下午4:21写道:

> When you 'run azure', check the build after a minute or so and you
> should see that the failed job is running again.
>
>
>
> On 29/10/2021 10:19, Chesnay Schepler wrote:
> > It's working just fine. You need to rebase your branch because on your
> > branch flink-python is simply broken.
> >
> > On 29/10/2021 08:57, JING ZHANG wrote:
> >> Hi Chesnay,
> >> I meet with something weird.
> >> I submit a pull request https://github.com/apache/flink/pull/17571,
> >> it trigger a ci
> >> (
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=25470=results
> >> <
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=25470=results>)
>
> >> which fails at first time because of python module failure.
> >> I try to run 'run azure' for two times, nothing happened which is
> >> different from the behavior in the past.
> >> I ask some other guys, they said they also meet with this problem.
> >> Please help me check whether it is expected, thanks a lot.
> >> image.png
> >>
> >> Best,
> >> JING ZHANG
> >>
> >> Yuepeng Pan  于2021年10月12日周二 下午8:08写道:
> >>
> >> Thank you for your effort!
> >> Best,
> >> Roc
> >>
> >> At 2021-10-12 14:07:18, "Arvid Heise"  wrote:
> >> >Awesome!
> >> >
> >> >On Tue, Oct 12, 2021 at 3:11 AM Guowei Ma 
> >> wrote:
> >> >
> >> >> Thanks for your effort!
> >> >>
> >> >> Best,
> >> >> Guowei
> >> >>
> >> >>
> >> >> On Mon, Oct 11, 2021 at 9:26 PM Stephan Ewen 
> >> wrote:
> >> >>
> >> >> > Great initiative, thanks for doing this!
> >> >> >
> >> >> > On Mon, Oct 11, 2021 at 10:52 AM Till Rohrmann
> >> 
> >> >> > wrote:
> >> >> >
> >> >> > > Thanks a lot for this effort Chesnay! The improvements
> >> sound really
> >> >> good.
> >> >> > >
> >> >> > > Cheers,
> >> >> > > Till
> >> >> > >
> >> >> > > On Mon, Oct 11, 2021 at 8:46 AM David Morávek
> >>  wrote:
> >> >> > >
> >> >> > > > Nice! Thanks for the effort Chesnay, this is really a
> >> huge step
> >> >> > forward!
> >> >> > > >
> >> >> > > > Best,
> >> >> > > > D.
> >> >> > > >
> >> >> > > > On Mon, Oct 11, 2021 at 6:02 AM Xintong Song
> >> 
> >> >> > > > wrote:
> >> >> > > >
> >> >> > > > > Thanks for the effort, @Chesnay. This is super helpful.
> >> >> > > > >
> >> >> > > > > @Jing,
> >> >> > > > > Every push to the PR branch should automatically
> >> trigger an entire
> >> >> > new
> >> >> > > > > build. `@flinkbot run azure` should only be used when
> >> you want to
> >> >> > > re-run
> >> >> > > > > the failed stages without changing the PR. E.g., when
> >> running into
> >> >> > > known
> >> >> > > > > unstable cases that are unrelated to the PR.
> >> >> > > > >
> >> >> > > > > Thank you~
> >> >> > > > >
> >> >> > > > > Xintong Song
> >> >> > > > >
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Mon, Oct 11, 2021 at 11:45 AM JING ZHANG
> >> 
> >> >> > > > wrote:
> >> >> > > > >
> >> >> > > > > > Hi Chesnay,
> >> >> > > > > > Thanks for the effort. It is a very useful improvement.
> >> >> > > > > > I have a minor question. Please forgive me if the
>

Re: Re: [NOTICE] CiBot improvements

2021-10-29 Thread JING ZHANG
Hi Chesnay,
I meet with something weird.
I submit a pull request https://github.com/apache/flink/pull/17571, it
trigger a ci (
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=25470=results)
which fails at first time because of python module failure.
I try to run 'run azure' for two times, nothing happened which is different
from the behavior in the past.
I ask some other guys, they said they also meet with this problem.
Please help me check whether it is expected, thanks a lot.
[image: image.png]

Best,
JING ZHANG

Yuepeng Pan  于2021年10月12日周二 下午8:08写道:

> Thank you for your effort!
> Best,
> Roc
>
> At 2021-10-12 14:07:18, "Arvid Heise"  wrote:
> >Awesome!
> >
> >On Tue, Oct 12, 2021 at 3:11 AM Guowei Ma  wrote:
> >
> >> Thanks for your effort!
> >>
> >> Best,
> >> Guowei
> >>
> >>
> >> On Mon, Oct 11, 2021 at 9:26 PM Stephan Ewen  wrote:
> >>
> >> > Great initiative, thanks for doing this!
> >> >
> >> > On Mon, Oct 11, 2021 at 10:52 AM Till Rohrmann 
> >> > wrote:
> >> >
> >> > > Thanks a lot for this effort Chesnay! The improvements sound really
> >> good.
> >> > >
> >> > > Cheers,
> >> > > Till
> >> > >
> >> > > On Mon, Oct 11, 2021 at 8:46 AM David Morávek 
> wrote:
> >> > >
> >> > > > Nice! Thanks for the effort Chesnay, this is really a huge step
> >> > forward!
> >> > > >
> >> > > > Best,
> >> > > > D.
> >> > > >
> >> > > > On Mon, Oct 11, 2021 at 6:02 AM Xintong Song <
> tonysong...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Thanks for the effort, @Chesnay. This is super helpful.
> >> > > > >
> >> > > > > @Jing,
> >> > > > > Every push to the PR branch should automatically trigger an
> entire
> >> > new
> >> > > > > build. `@flinkbot run azure` should only be used when you want
> to
> >> > > re-run
> >> > > > > the failed stages without changing the PR. E.g., when running
> into
> >> > > known
> >> > > > > unstable cases that are unrelated to the PR.
> >> > > > >
> >> > > > > Thank you~
> >> > > > >
> >> > > > > Xintong Song
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Mon, Oct 11, 2021 at 11:45 AM JING ZHANG <
> beyond1...@gmail.com>
> >> > > > wrote:
> >> > > > >
> >> > > > > > Hi Chesnay,
> >> > > > > > Thanks for the effort. It is a very useful improvement.
> >> > > > > > I have a minor question. Please forgive me if the question is
> too
> >> > > > naive.
> >> > > > > > Since '@flinkbot run azure' now behaves like "Rerun failed
> jobs",
> >> > is
> >> > > > > there
> >> > > > > > any way to trigger an entirely new build? Because some times
> I'm
> >> > not
> >> > > > sure
> >> > > > > > the passed cases in the last build could still success in the
> new
> >> > > build
> >> > > > > > because of introduced updates in new commit.
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > JING ZHANG
> >> > > > > >
> >> > > > > >
> >> > > > > > Yangze Guo  于2021年10月11日周一 上午10:31写道:
> >> > > > > >
> >> > > > > > > Thanks for that great job, Chesnay! "Rerun failed jobs" will
> >> > help a
> >> > > > > lot.
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > Yangze Guo
> >> > > > > > >
> >> > > > > > > On Sun, Oct 10, 2021 at 4:56 PM Chesnay Schepler <
> >> > > ches...@apache.org
> >> > > > >
> >> > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > I made a number of changes to the CiBot over the weekend.
> >> > > > > > > >
> >> > > > > > > > - '@flinkbot run azure' previously triggered an entirely
> new
> >> > > build
> >> > > > > > based
> >> > > > > > > > on the last completed one. It now instead retries the last
> >> > > > completed
> >> > > > > > > > build, only running the jobs that actually failed. It
> >> basically
> >> > > > > behaves
> >> > > > > > > > like the "Rerun failed jobs" button in the Azure UI.
> >> > > > > > > > - various optimizations to increase responsiveness
> (primarily
> >> > by
> >> > > > > doing
> >> > > > > > > > significantly less unnecessary work / requests to GH)
> >> > > > > > > > - removed TravisCI support (since we no longer support a
> >> > release
> >> > > > that
> >> > > > > > > > used Travis)
> >> > > > > > > >
> >> > > > > > > > Please ping me if you spot anything weird.
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>


[jira] [Created] (FLINK-24656) Add user document for Window Deduplication

2021-10-26 Thread JING ZHANG (Jira)
JING ZHANG created FLINK-24656:
--

 Summary: Add user document for Window Deduplication
 Key: FLINK-24656
 URL: https://issues.apache.org/jira/browse/FLINK-24656
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.15.0
Reporter: JING ZHANG






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] Apache Flink 1.13.3 released

2021-10-21 Thread JING ZHANG
Thank Chesnay, Martijn and every contributor for making this happen!


Thomas Weise  于2021年10月22日周五 上午12:15写道:

> Thanks for making the release happen!
>
> On Thu, Oct 21, 2021 at 5:54 AM Leonard Xu  wrote:
> >
> > Thanks to Chesnay & Martijn and everyone who made this release happen.
> >
> >
> > > 在 2021年10月21日,20:08,Martijn Visser  写道:
> > >
> > > Thank you Chesnay, Leonard and all contributors!
> > >
> > > On Thu, 21 Oct 2021 at 13:40, Jingsong Li  > wrote:
> > > Thanks, Chesnay & Martijn
> > >
> > > 1.13.3 really solves many problems.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Thu, Oct 21, 2021 at 6:46 PM Konstantin Knauf  > wrote:
> > > >
> > > > Thank you, Chesnay & Martijn, for managing this release!
> > > >
> > > > On Thu, Oct 21, 2021 at 10:29 AM Chesnay Schepler <
> ches...@apache.org >
> > > > wrote:
> > > >
> > > > > The Apache Flink community is very happy to announce the release of
> > > > > Apache Flink 1.13.3, which is the third bugfix release for the
> Apache
> > > > > Flink 1.13 series.
> > > > >
> > > > > Apache Flink® is an open-source stream processing framework for
> > > > > distributed, high-performing, always-available, and accurate data
> > > > > streaming applications.
> > > > >
> > > > > The release is available for download at:
> > > > > https://flink.apache.org/downloads.html <
> https://flink.apache.org/downloads.html>
> > > > >
> > > > > Please check out the release blog post for an overview of the
> > > > > improvements for this bugfix release:
> > > > > https://flink.apache.org/news/2021/10/19/release-1.13.3.html <
> https://flink.apache.org/news/2021/10/19/release-1.13.3.html>
> > > > >
> > > > > The full release notes are available in Jira:
> > > > >
> > > > >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12350329
> <
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12350329
> >
> > > > >
> > > > > We would like to thank all contributors of the Apache Flink
> community
> > > > > who made this release possible!
> > > > >
> > > > > Regards,
> > > > > Chesnay
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Konstantin Knauf
> > > >
> > > > https://twitter.com/snntrable 
> > > >
> > > > https://github.com/knaufk 
> > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> >
>


Re: Multiple View Creation support

2021-10-19 Thread JING ZHANG
Hi,
I'm not sure I understand your question. Are you looking for a way to
define multiple view in SQL? Please try define multiple view by define
multiple create view query, you could find create view syntax in document
[1]
Please let me know if I misunderstand your requirement.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#create-view

Hugar, Mahesh  于2021年10月19日周二 下午4:02写道:

> Hi,
> I went through the flink documents, multiple view creation through flink
> approach not able to findout, please help me here for implementation.
> Thanks in advance.
> Regards,
> Mahesh Kumar GH
>


Re: [VOTE] Release 1.13.3, release candidate #1

2021-10-18 Thread JING ZHANG
Thanks Chesnay for driving this.

+1 (non-binding)

- built from source code flink-1.13.3-src.tgz
<https://dist.apache.org/repos/dist/dev/flink/flink-1.13.3-rc1/flink-1.13.3-src.tgz>
succeeded
- started a standalone Flink cluster, ran the WordCount example, WebUI
looks good,  no suspicious output/log.
- started cluster and run some e2e sql queries using SQL Client, query
result is expected
- repeat step 2 and step 3 with flink-1.13.3-bin-scala_2.11.tgz
<https://dist.apache.org/repos/dist/dev/flink/flink-1.13.3-rc1/flink-1.13.3-bin-scala_2.11.tgz>
- repeat step 2 and step 3 with flink-1.13.3-bin-scala_2.12.tgz
<https://dist.apache.org/repos/dist/dev/flink/flink-1.13.3-rc1/flink-1.13.3-bin-scala_2.12.tgz>

Best,
JING ZHANG

Matthias Pohl  于2021年10月15日周五 下午10:07写道:

> Thanks Chesnay for driving this.
>
> +1 (non-binding)
>
> - verified the checksums
> - build 1.13.3-rc1 from sources
> - went over the pom file diff to see whether we missed newly added
> dependency in the NOTICE file
> - went over the release blog post
> - checked that scala 2.11 and 2.12 artifacts are present in the Maven repo
> - Run example jobs without noticing any issues in the logs
> - Triggered e2e test run on VVP based on 1.13.3 RC1
>
> On Tue, Oct 12, 2021 at 7:22 PM Chesnay Schepler 
> wrote:
>
> > Hi everyone,
> > Please review and vote on the release candidate #1 for the version
> > 1.13.3, as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> > be deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint C2EED7B111D464BA [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "release-1.13.3-rc1" [5],
> > * website pull request listing the new release and adding announcement
> > blog post [6].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Release Manager
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12350329
> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.13.3-rc1/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1453
> > [5] https://github.com/apache/flink/tree/release-1.13.3-rc1
> > [6] https://github.com/apache/flink-web/pull/473
> >
>


  1   2   >