Re: [DISCUSS] Flink client api enhancement for downstream project

2019-08-19 Thread Zili Chen
Hi Aljoscha,

Thanks for your reply and participance. The Google Doc you linked to
requires
permission and I think you could use a share link instead.

I agree with that we almost reach a consensus that JobClient is necessary to
interacte with a running Job.

Let me check your open questions one by one.

1. Separate cluster creation and job submission for per-job mode.

As you mentioned here is where the opinions diverge. In my document there is
an alternative[2] that proposes excluding per-job deployment from client api
scope and now I find it is more reasonable we do the exclusion.

When in per-job mode, a dedicated JobCluster is launched to execute the
specific job. It is like a Flink Application more than a submission
of Flink Job. Client only takes care of job submission and assume there is
an existing cluster. In this way we are able to consider per-job issues
individually and JobClusterEntrypoint would be the utility class for per-job
deployment.

Nevertheless, user program works in both session mode and per-job mode
without
necessary to change code. JobClient in per-job mode is returned from
env.execute as normal. However, it would be no longer a wrapper of
RestClusterClient but a wrapper of PerJobClusterClient which communicates
to Dispatcher locally.

2. How to deal with plan preview.

With env.compile functions users can get JobGraph or FlinkPlan and thus
they can preview the plan with programming. Typically it looks like

if (preview configured) {
FlinkPlan plan = env.compile();
new JSONDumpGenerator(...).dump(plan);
} else {
env.execute();
}

And `flink info` would be invalid any more.

3. How to deal with Jar Submission at the Web Frontend.

There is one more thread talked on this topic[1]. Apart from removing
the functions there are two alternatives.

One is to introduce an interface has a method returns JobGraph/FilnkPlan
and Jar Submission only support main-class implements this interface.
And then extract the JobGraph/FlinkPlan just by calling the method.
In this way, it is even possible to consider a separation of job creation
and job submission.

The other is, as you mentioned, let execute() do the actual execution.
We won't execute the main method in the WebFrontend but spawn a process
at WebMonitor side to execute. For return part we could generate the
JobID from WebMonitor and pass it to the execution environemnt.

4. How to deal with detached mode.

I think detached mode is a temporary solution for non-blocking submission.
In my document both submission and execution return a CompletableFuture and
users control whether or not wait for the result. In this point we don't
need a detached option but the functionality is covered.

5. How does per-job mode interact with interactive programming.

All of YARN, Mesos and Kubernetes scenarios follow the pattern launch a
JobCluster now. And I don't think there would be inconsistency between
different resource management.

Best,
tison.

[1]
https://lists.apache.org/x/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
[2]
https://docs.google.com/document/d/1UWJE7eYWiMuZewBKS0YmdVO2LUTqXPd6-pbOCof9ddY/edit?disco=DZaGGfs

Aljoscha Krettek  于2019年8月16日周五 下午9:20写道:

> Hi,
>
> I read both Jeffs initial design document and the newer document by Tison.
> I also finally found the time to collect our thoughts on the issue, I had
> quite some discussions with Kostas and this is the result: [1].
>
> I think overall we agree that this part of the code is in dire need of
> some refactoring/improvements but I think there are still some open
> questions and some differences in opinion what those refactorings should
> look like.
>
> I think the API-side is quite clear, i.e. we need some JobClient API that
> allows interacting with a running Job. It could be worthwhile to spin that
> off into a separate FLIP because we can probably find consensus on that
> part more easily.
>
> For the rest, the main open questions from our doc are these:
>
>   - Do we want to separate cluster creation and job submission for per-job
> mode? In the past, there were conscious efforts to *not* separate job
> submission from cluster creation for per-job clusters for Mesos, YARN,
> Kubernets (see StandaloneJobClusterEntryPoint). Tison suggests in his
> design document to decouple this in order to unify job submission.
>
>   - How to deal with plan preview, which needs to hijack execute() and let
> the outside code catch an exception?
>
>   - How to deal with Jar Submission at the Web Frontend, which needs to
> hijack execute() and let the outside code catch an exception?
> CliFrontend.run() “hijacks” ExecutionEnvironment.execute() to get a
> JobGraph and then execute that JobGraph manually. We could get around that
> by letting execute() do the actual execution. One caveat for this is that
> now the main() method doesn’t return (or is forced to return by throwing an
> exception from execute()) which means that for Jar 

Re: [VOTE] Apache Flink 1.9.0, release candidate #3

2019-08-19 Thread Zili Chen
+1 (non-binding)

- build from source: OK(8u212)
- check local setup tutorial works as expected

Best,
tison.


Yu Li  于2019年8月20日周二 上午8:24写道:

> +1 (non-binding)
>
> - checked release notes: OK
> - checked sums and signatures: OK
> - repository appears to contain all expected artifacts
> - source release
>  - contains no binaries: OK
>  - contains no 1.9-SNAPSHOT references: OK
>  - build from source: OK (8u102)
> - binary release
>  - no examples appear to be missing
>  - started a cluster; WebUI reachable, example ran successfully
> - checked README.md file and found nothing unexpected
>
> Best Regards,
> Yu
>
>
> On Tue, 20 Aug 2019 at 01:16, Tzu-Li (Gordon) Tai 
> wrote:
>
> > Hi all,
> >
> > Release candidate #3 for Apache Flink 1.9.0 is now ready for your review.
> >
> > Please review and vote on release candidate #3 for version 1.9.0, as
> > follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint 1C1E2394D3194E1944613488F320986D35C33D6A [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag “release-1.9.0-rc3” [5].
> > * pull requests for the release note documentation [6] and announcement
> > blog post [7].
> >
> > As proposed in the RC2 vote thread [8], for RC3 we are only
> cherry-picking
> > minimal specific changes on top of RC2 to be able to reasonably carry
> over
> > previous testing efforts and effectively require a shorter voting time.
> >
> > The only extra commits in this RC, compared to RC2, are the following:
> > - c2d9aeac [FLINK-13231] [pubsub] Replace Max outstanding acknowledgement
> > ids limit with a FlinkConnectorRateLimiter
> > - d8941711 [FLINK-13699][table-api] Fix TableFactory doesn’t work with
> DDL
> > when containing TIMESTAMP/DATE/TIME types
> > - 04e95278 [FLINK-13752] Only references necessary variables when
> > bookkeeping result partitions on TM
> >
> > Due to the minimal set of changes, the vote for RC3 will be *open for
> only
> > 48 hours*.
> > Please cast your votes before *Aug. 21st (Wed.) 2019, 17:00 PM CET*.
> >
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >
> > Thanks,
> > Gordon
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12344601
> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.9.0-rc3/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapacheflink-1236
> > [5]
> >
> >
> https://gitbox.apache.org/repos/asf?p=flink.git;a=tag;h=refs/tags/release-1.9.0-rc3
> > [6] https://github.com/apache/flink/pull/9438
> > [7] https://github.com/apache/flink-web/pull/244
> > [8]
> >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Apache-Flink-Release-1-9-0-release-candidate-2-tp31542p31933.html
> >
>


Re: [DISCUSS] FLIP-56: Dynamic Slot Allocation

2019-08-19 Thread Zili Chen
We suddenly skipped FLIP-55 lol.


Xintong Song  于2019年8月19日周一 下午10:23写道:

> Hi everyone,
>
> We would like to start a discussion thread on "FLIP-56: Dynamic Slot
> Allocation" [1]. This is originally part of the discussion thread for
> "FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, we
> would like split the original discussion into two topics, and start a
> separate new discussion thread as well as FLIP process for this one.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation
>
> [2]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html
>


Re: [VOTE] Apache Flink 1.9.0, release candidate #3

2019-08-19 Thread Yu Li
+1 (non-binding)

- checked release notes: OK
- checked sums and signatures: OK
- repository appears to contain all expected artifacts
- source release
 - contains no binaries: OK
 - contains no 1.9-SNAPSHOT references: OK
 - build from source: OK (8u102)
- binary release
 - no examples appear to be missing
 - started a cluster; WebUI reachable, example ran successfully
- checked README.md file and found nothing unexpected

Best Regards,
Yu


On Tue, 20 Aug 2019 at 01:16, Tzu-Li (Gordon) Tai 
wrote:

> Hi all,
>
> Release candidate #3 for Apache Flink 1.9.0 is now ready for your review.
>
> Please review and vote on release candidate #3 for version 1.9.0, as
> follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 1C1E2394D3194E1944613488F320986D35C33D6A [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag “release-1.9.0-rc3” [5].
> * pull requests for the release note documentation [6] and announcement
> blog post [7].
>
> As proposed in the RC2 vote thread [8], for RC3 we are only cherry-picking
> minimal specific changes on top of RC2 to be able to reasonably carry over
> previous testing efforts and effectively require a shorter voting time.
>
> The only extra commits in this RC, compared to RC2, are the following:
> - c2d9aeac [FLINK-13231] [pubsub] Replace Max outstanding acknowledgement
> ids limit with a FlinkConnectorRateLimiter
> - d8941711 [FLINK-13699][table-api] Fix TableFactory doesn’t work with DDL
> when containing TIMESTAMP/DATE/TIME types
> - 04e95278 [FLINK-13752] Only references necessary variables when
> bookkeeping result partitions on TM
>
> Due to the minimal set of changes, the vote for RC3 will be *open for only
> 48 hours*.
> Please cast your votes before *Aug. 21st (Wed.) 2019, 17:00 PM CET*.
>
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Gordon
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12344601
> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.9.0-rc3/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1236
> [5]
>
> https://gitbox.apache.org/repos/asf?p=flink.git;a=tag;h=refs/tags/release-1.9.0-rc3
> [6] https://github.com/apache/flink/pull/9438
> [7] https://github.com/apache/flink-web/pull/244
> [8]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Apache-Flink-Release-1-9-0-release-candidate-2-tp31542p31933.html
>


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Tzu-Li (Gordon) Tai
Vote thread for RC3 has been started:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Apache-Flink-1-9-0-release-candidate-3-td31988.html

On Mon, Aug 19, 2019 at 6:32 PM Tzu-Li (Gordon) Tai 
wrote:

> Thanks for the comments and fast fixes.
>
> @Becket Qin  I've quickly looked at the changes to
> the PubSub connector. Given that it is a API-breaking change and is quite
> local as a configuration change, I've decided to include that change in RC3.
> @Jark @Timo Walther  I'll be adding FLINK-13699 as
> well.
>
> Quick update regarding the LICENSE issue with flink-runtime-web: I've
> doubled checked this and the licenses for the new bundled Javascript
> dependencies are actually already correctly present under the root
> licenses-binary/ directory, so we actually don't need additional changes
> for this.
>
> I've started to create RC3 now, will post the vote as soon as it is ready.
>
> Cheers,
> Gordon
>
> On Mon, Aug 19, 2019 at 3:28 PM Stephan Ewen  wrote:
>
>> Looking at FLINK-13699, it seems to be very local to Table API and HBase
>> connector.
>> We can cherry-pick that without re-running distributed tests.
>>
>>
>> On Mon, Aug 19, 2019 at 1:46 PM Till Rohrmann 
>> wrote:
>>
>> > I've merged the fix for FLINK-13752. Hence we are good to go to create
>> the
>> > new RC.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Mon, Aug 19, 2019 at 1:30 PM Timo Walther 
>> wrote:
>> >
>> > > I support Jark's fix for FLINK-13699 because it would be disappointing
>> > > if both DDL and connectors are ready to handle DATE/TIME/TIMESTAMP
>> but a
>> > > little component in the middle of the stack is preventing an otherwise
>> > > usable feature. The changes are minor.
>> > >
>> > > Thanks,
>> > > Timo
>> > >
>> > >
>> > > Am 19.08.19 um 13:24 schrieb Jark Wu:
>> > > > Hi Gordon,
>> > > >
>> > > > I agree that we should pick the minimal set of changes to shorten
>> the
>> > > > release testing time.
>> > > > However, I would like to include FLINK-13699 into RC3. FLINK-13699
>> is a
>> > > > critical DDL issue, and is a small change to flink table (won't
>> affect
>> > > the
>> > > > runtime feature and stability).
>> > > > I will do some tests around sql and blink planner if the RC3 include
>> > this
>> > > > fix.
>> > > >
>> > > > But if the community against to include it, I'm also fine with
>> having
>> > it
>> > > in
>> > > > the next minor release.
>> > > >
>> > > > Thanks,
>> > > > Jark
>> > > >
>> > > > On Mon, 19 Aug 2019 at 16:16, Stephan Ewen 
>> wrote:
>> > > >
>> > > >> +1 for Gordon's approach.
>> > > >>
>> > > >> If we do that, we can probably skip re-testing everything and
>> mainly
>> > > need
>> > > >> to verify the release artifacts (signatures, build from source,
>> etc.).
>> > > >>
>> > > >> If we open the RC up for changes, I fear a lot of small issues will
>> > > rush in
>> > > >> and destabilize the candidate again, meaning we have to do another
>> > > larger
>> > > >> testing effort.
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Mon, Aug 19, 2019 at 9:48 AM Becket Qin 
>> > > wrote:
>> > > >>
>> > > >>> Hi Gordon,
>> > > >>>
>> > > >>> I remember we mentioned earlier that if there is an additional
>> RC, we
>> > > can
>> > > >>> piggyback the GCP PubSub API change (
>> > > >>> https://issues.apache.org/jira/browse/FLINK-13231). It is a small
>> > > patch
>> > > >> to
>> > > >>> avoid future API change. So should be able to merge it very
>> shortly.
>> > > >> Would
>> > > >>> it be possible to include that into RC3 as well?
>> > > >>>
>> > > >>> Thanks,
>> > > >>>
>> > > >>> Jiangjie (Becket) Qin
>> > > >>>
>> > > >>> On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai <
>> > > tzuli...@apache.org
>> > > >>>
>> > > >>> wrote:
>> > > >>>
>> > >  Hi,
>> > > 
>> > >  https://issues.apache.org/jira/browse/FLINK-13752 turns out to
>> be
>> > an
>> > >  actual
>> > >  blocker, so we would have to close this RC now in favor of a new
>> > one.
>> > > 
>> > >  Since we are already quite past the planned release time for
>> 1.9.0,
>> > I
>> > > >>> would
>> > >  like to limit the new changes included in RC3 to only the
>> following:
>> > >  - https://issues.apache.org/jira/browse/FLINK-13752
>> > >  - Fix license and notice file issues that Kurt had found with
>> > >  flink-runtime-web and flink-state-processing-api
>> > > 
>> > >  This means that I will not be creating RC3 with the release-1.9
>> > branch
>> > > >> as
>> > >  is, but essentially only cherry-picking the above mentioned
>> changes
>> > on
>> > > >>> top
>> > >  of RC2.
>> > >  The minimal set of changes on top of RC2 should allow us to carry
>> > most
>> > > >> if
>> > >  not all of the already existing votes without another round of
>> > > >> extensive
>> > >  testing, and allow us to have a shortened voting time.
>> > > 
>> > >  I understand that there are other issues mentioned in this thread
>> > that
>> > > >>> are
>> > >  already spotted and 

[VOTE] Apache Flink 1.9.0, release candidate #3

2019-08-19 Thread Tzu-Li (Gordon) Tai
Hi all,

Release candidate #3 for Apache Flink 1.9.0 is now ready for your review.

Please review and vote on release candidate #3 for version 1.9.0, as
follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint 1C1E2394D3194E1944613488F320986D35C33D6A [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag “release-1.9.0-rc3” [5].
* pull requests for the release note documentation [6] and announcement
blog post [7].

As proposed in the RC2 vote thread [8], for RC3 we are only cherry-picking
minimal specific changes on top of RC2 to be able to reasonably carry over
previous testing efforts and effectively require a shorter voting time.

The only extra commits in this RC, compared to RC2, are the following:
- c2d9aeac [FLINK-13231] [pubsub] Replace Max outstanding acknowledgement
ids limit with a FlinkConnectorRateLimiter
- d8941711 [FLINK-13699][table-api] Fix TableFactory doesn’t work with DDL
when containing TIMESTAMP/DATE/TIME types
- 04e95278 [FLINK-13752] Only references necessary variables when
bookkeeping result partitions on TM

Due to the minimal set of changes, the vote for RC3 will be *open for only
48 hours*.
Please cast your votes before *Aug. 21st (Wed.) 2019, 17:00 PM CET*.

It is adopted by majority approval, with at least 3 PMC affirmative votes.

Thanks,
Gordon

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12344601
[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.9.0-rc3/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1236
[5]
https://gitbox.apache.org/repos/asf?p=flink.git;a=tag;h=refs/tags/release-1.9.0-rc3
[6] https://github.com/apache/flink/pull/9438
[7] https://github.com/apache/flink-web/pull/244
[8]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-Apache-Flink-Release-1-9-0-release-candidate-2-tp31542p31933.html


[jira] [Created] (FLINK-13791) Speed up sidenav by using group_by

2019-08-19 Thread Nico Kruber (Jira)
Nico Kruber created FLINK-13791:
---

 Summary: Speed up sidenav by using group_by
 Key: FLINK-13791
 URL: https://issues.apache.org/jira/browse/FLINK-13791
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: Nico Kruber
Assignee: Nico Kruber


{{_includes/sidenav.html}} parses through {{pages_by_language}} over and over 
again trying to find children when building the (recursive) side navigation. We 
could do this once with a {{group_by}} instead.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Cwiki edit access

2019-08-19 Thread Thomas Weise
Thanks!


On Mon, Aug 19, 2019 at 1:19 AM Till Rohrmann  wrote:

> Hi Thomas,
>
> I've given you access. You should be able to access it now with your Apache
> account. Please let me know if something is not working.
>
> Cheers,
> Till
>
> On Mon, Aug 19, 2019 at 6:15 AM Thomas Weise  wrote:
>
> > Hi,
> >
> > I would like to be able to edit pages in the Confluence Flink space. Can
> > someone give me access please?
> >
> > Thanks
> >
>


[jira] [Created] (FLINK-13790) Support -e option with a sql script file as input

2019-08-19 Thread Bowen Li (Jira)
Bowen Li created FLINK-13790:


 Summary: Support -e option with a sql script file as input
 Key: FLINK-13790
 URL: https://issues.apache.org/jira/browse/FLINK-13790
 Project: Flink
  Issue Type: Sub-task
  Components: Command Line Client
Reporter: Bowen Li
Assignee: Zhenghua Gao
 Fix For: 1.10.0


We expect user to run SQL directly on the command line. Something like: 
sql-client embedded -f "query in string", which will execute the given file 
without entering interactive mode

This is related to FLINK-12828.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [VOTE] Flink Project Bylaws

2019-08-19 Thread Henry Saputra
One of the perks of being committers is be able to commit code without
asking from another committer. Having said that, I think we rely on
maturity of the committers to know when to ask for reviews and when to
commit directly.

For example, if someone just change typos on comments or simple rename of
internal variables, I think we could trust the committer to safely commit
the changes. When the changes will have effect of changing or introduce new
flows of the code, that's when reviews are needed and strongly encouraged.
I think the balance is needed for this.

PMCs have the ability and right to revert changes in source repo as
necessary.

- Henry

On Sun, Aug 18, 2019 at 9:23 PM Thomas Weise  wrote:

> +0 (binding)
>
> I don't think committers should be allowed to approve their own changes. I
> would prefer if non-committer contributors can approve committer PRs as
> that would encourage more participation in code review and ability to
> contribute.
>
>
> On Fri, Aug 16, 2019 at 9:02 PM Shaoxuan Wang  wrote:
>
> > +1 (binding)
> >
> > On Fri, Aug 16, 2019 at 7:48 PM Chesnay Schepler 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Although I think it would be a good idea to always cc
> > > priv...@flink.apache.org when modifying bylaws, if anything to speed
> up
> > > the voting process.
> > >
> > > On 16/08/2019 11:26, Ufuk Celebi wrote:
> > > > +1 (binding)
> > > >
> > > > – Ufuk
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 4:50 AM Biao Liu  wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >> Thanks for pushing this!
> > > >>
> > > >> Thanks,
> > > >> Biao /'bɪ.aʊ/
> > > >>
> > > >>
> > > >>
> > > >> On Wed, 14 Aug 2019 at 09:37, Jark Wu  wrote:
> > > >>
> > > >>> +1 (non-binding)
> > > >>>
> > > >>> Best,
> > > >>> Jark
> > > >>>
> > > >>> On Wed, 14 Aug 2019 at 09:22, Kurt Young  wrote:
> > > >>>
> > >  +1 (binding)
> > > 
> > >  Best,
> > >  Kurt
> > > 
> > > 
> > >  On Wed, Aug 14, 2019 at 1:34 AM Yun Tang 
> wrote:
> > > 
> > > > +1 (non-binding)
> > > >
> > > > But I have a minor question about "code change" action, for those
> > > > "[hotfix]" github pull requests [1], the dev mailing list would
> not
> > > >> be
> > > > notified currently. I think we should change the description of
> > this
> > >  action.
> > > >
> > > > [1]
> > > >
> > > >>
> > >
> >
> https://flink.apache.org/contributing/contribute-code.html#code-contribution-process
> > > > Best
> > > > Yun Tang
> > > > 
> > > > From: JingsongLee 
> > > > Sent: Tuesday, August 13, 2019 23:56
> > > > To: dev 
> > > > Subject: Re: [VOTE] Flink Project Bylaws
> > > >
> > > > +1 (non-binding)
> > > > Thanks Becket.
> > > > I've learned a lot from current bylaws.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > >
> > > >
> --
> > > > From:Yu Li 
> > > > Send Time:2019年8月13日(星期二) 17:48
> > > > To:dev 
> > > > Subject:Re: [VOTE] Flink Project Bylaws
> > > >
> > > > +1 (non-binding)
> > > >
> > > > Thanks for the efforts Becket!
> > > >
> > > > Best Regards,
> > > > Yu
> > > >
> > > >
> > > > On Tue, 13 Aug 2019 at 16:09, Xintong Song <
> tonysong...@gmail.com>
> > >  wrote:
> > > >> +1 (non-binding)
> > > >>
> > > >> Thank you~
> > > >>
> > > >> Xintong Song
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Aug 13, 2019 at 1:48 PM Robert Metzger <
> > > >> rmetz...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> +1 (binding)
> > > >>>
> > > >>> On Tue, Aug 13, 2019 at 1:47 PM Becket Qin <
> becket@gmail.com
> > > > wrote:
> > >  Thanks everyone for voting.
> > > 
> > >  For those who have already voted, just want to bring this up
> to
> > >  your
> > >  attention that there is a minor clarification to the bylaws
> > > >> wiki
> > >  this
> > >  morning. The change is in bold format below:
> > > 
> > >  one +1 from a committer followed by a Lazy approval (not
> > > >> counting
> > >  the
> > > >>> vote
> > > > of the contributor), moving to lazy majority if a -1 is
> > > >>> received.
> > > 
> > >  Note that this implies that committers can +1 their own
> commits
> > > >>> and
> > > >> merge
> > > > right away. *However, the committe**rs should use their best
> > > >> judgement
> > > >>> to
> > > > respect the components expertise and ongoing development
> > > >> plan.*
> > > 
> > >  This addition does not really change anything the bylaws meant
> > > >> to
> > > > set.
> > > >> It
> > >  is simply a clarification. If anyone who have casted the vote
> > > > objects,
> > >  please feel free to withdraw the vote.
> > > 

Re: flink release-1.8.0 Flink-avro unit tests failed

2019-08-19 Thread Ethan Li
It’s probably the encoding problem. The environment I ran the unit tests on 
uses ANSI_X3.4-1968
 
It looks like we have to use "en_US.UTF-8"


> On Aug 19, 2019, at 1:44 PM, Ethan Li  wrote:
> 
> Hello,
> 
> Not sure if anyone encountered this issue before.  I tried to run “mvn clean 
> install” on flink release-1.8, but some unit tests in flink-arvo module 
> failed:
> 
> 
> [ERROR] Tests run: 12, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 4.81 
> s <<< FAILURE! - in 
> org.apache.flink.formats.avro.typeutils.AvroTypeExtractionTest
> [ERROR] testSimpleAvroRead[Execution mode = 
> CLUSTER](org.apache.flink.formats.avro.typeutils.AvroTypeExtractionTest)  
> Time elapsed: 0.438 s  <<< FAILURE!
> java.lang.AssertionError: 
> Different elements in arrays: expected 2 elements and received 2
> files: [/tmp/junit5386344396421857812/junit6023978980792200274.tmp/4, 
> /tmp/junit5386344396421857812/junit6023978980792200274.tmp/2, 
> /tmp/junit5386344396421857812/junit6023978980792200274.tmp/1, 
> /tmp/junit5386344396421857812/junit6023978980792200274.tmp/3]
>  expected: [{"name": "Alyssa", "favorite_number": 256, "favorite_color": 
> null, "type_long_test": null, "type_double_test": 123.45, "type_null_test": 
> null, "type_bool_test": true, "type_array_string": ["ELEMENT 1", "ELEMENT 
> 2"], "type_array_boolean": [true, false], "type_nullable_array": null, 
> "type_enum": "GREEN", "type_map": {"KEY 2": 17554, "KEY 1": 8546456}, 
> "type_fixed": null, "type_union": null, "type_nested": {"num": 239, "street": 
> "Baker Street", "city": "London", "state": "London", "zip": "NW1 6XE"}, 
> "type_bytes": {"bytes": 
> "\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
> 2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
> "type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
> 123456, "type_decimal_bytes": {"bytes": "\u0007?"}, "type_decimal_fixed": [7, 
> -48]}, {"name": "Charlie", "favorite_number": null, "favorite_color": "blue", 
> "type_long_test": 1337, "type_double_test": 1.337, "type_null_test": null, 
> "type_bool_test": false, "type_array_string": [], "type_array_boolean": [], 
> "type_nullable_array": null, "type_enum": "RED", "type_map": {}, 
> "type_fixed": null, "type_union": null, "type_nested": {"num": 239, "street": 
> "Baker Street", "city": "London", "state": "London", "zip": "NW1 6XE"}, 
> "type_bytes": {"bytes": 
> "\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
> 2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
> "type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
> 123456, "type_decimal_bytes": {"bytes": "\u0007?"}, "type_decimal_fixed": [7, 
> -48]}]
>  received: [{"name": "Alyssa", "favorite_number": 256, "favorite_color": 
> null, "type_long_test": null, "type_double_test": 123.45, "type_null_test": 
> null, "type_bool_test": true, "type_array_string": ["ELEMENT 1", "ELEMENT 
> 2"], "type_array_boolean": [true, false], "type_nullable_array": null, 
> "type_enum": "GREEN", "type_map": {"KEY 2": 17554, "KEY 1": 8546456}, 
> "type_fixed": null, "type_union": null, "type_nested": {"num": 239, "street": 
> "Baker Street", "city": "London", "state": "London", "zip": "NW1 6XE"}, 
> "type_bytes": {"bytes": 
> "\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
> 2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
> "type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
> 123456, "type_decimal_bytes": {"bytes": "\u0007??"}, "type_decimal_fixed": 
> [7, -48]}, {"name": "Charlie", "favorite_number": null, "favorite_color": 
> "blue", "type_long_test": 1337, "type_double_test": 1.337, "type_null_test": 
> null, "type_bool_test": false, "type_array_string": [], "type_array_boolean": 
> [], "type_nullable_array": null, "type_enum": "RED", "type_map": {}, 
> "type_fixed": null, "type_union": null, "type_nested": {"num": 239, "street": 
> "Baker Street", "city": "London", "state": "London", "zip": "NW1 6XE"}, 
> "type_bytes": {"bytes": 
> "\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
> 2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
> "type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
> 123456, "type_decimal_bytes": {"bytes": "\u0007??"}, "type_decimal_fixed": 
> [7, -48]}]
>   at 
> org.apache.flink.formats.avro.typeutils.AvroTypeExtractionTest.after(AvroTypeExtractionTest.java:76)
> 
> 
> 
> Comparing “expected” with “received”, there is really some question mark 
> difference.
> 
> For example, in “expected’, it’s
> 
> "type_decimal_bytes": {"bytes": "\u0007?”}
> 
> While in “received”, it’s 
> 
> "type_decimal_bytes": {"bytes": "\u0007??"}
> 
> 
> I would really appreciate it if anyone could shed some light on this. Thanks
> 
> Ethan



[jira] [Created] (FLINK-13789) Transactional Id Generation fails due to user code impacting formatting string

2019-08-19 Thread Hao Dang (Jira)
Hao Dang created FLINK-13789:


 Summary: Transactional Id Generation fails due to user code 
impacting formatting string
 Key: FLINK-13789
 URL: https://issues.apache.org/jira/browse/FLINK-13789
 Project: Flink
  Issue Type: Bug
Reporter: Hao Dang


In 
[TransactionalIdsGenerator.java|[https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/internal/TransactionalIdsGenerator.java#L94]],
 prefix contains taskName of the particular task which could ultimately contain 
user code.  In some cases, user code contains conversion specifiers like %, the 
string formatting could fail.

For example, in Flink SQL code, user could have a LIKE statement with a % 
wildcard, the % wildcard will end up in the prefix and get mistreated during 
formatting, causing task to fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


flink release-1.8.0 Flink-avro unit tests failed

2019-08-19 Thread Ethan Li
Hello,

Not sure if anyone encountered this issue before.  I tried to run “mvn clean 
install” on flink release-1.8, but some unit tests in flink-arvo module failed:


[ERROR] Tests run: 12, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 4.81 s 
<<< FAILURE! - in org.apache.flink.formats.avro.typeutils.AvroTypeExtractionTest
[ERROR] testSimpleAvroRead[Execution mode = 
CLUSTER](org.apache.flink.formats.avro.typeutils.AvroTypeExtractionTest)  Time 
elapsed: 0.438 s  <<< FAILURE!
java.lang.AssertionError: 
Different elements in arrays: expected 2 elements and received 2
files: [/tmp/junit5386344396421857812/junit6023978980792200274.tmp/4, 
/tmp/junit5386344396421857812/junit6023978980792200274.tmp/2, 
/tmp/junit5386344396421857812/junit6023978980792200274.tmp/1, 
/tmp/junit5386344396421857812/junit6023978980792200274.tmp/3]
 expected: [{"name": "Alyssa", "favorite_number": 256, "favorite_color": null, 
"type_long_test": null, "type_double_test": 123.45, "type_null_test": null, 
"type_bool_test": true, "type_array_string": ["ELEMENT 1", "ELEMENT 2"], 
"type_array_boolean": [true, false], "type_nullable_array": null, "type_enum": 
"GREEN", "type_map": {"KEY 2": 17554, "KEY 1": 8546456}, "type_fixed": null, 
"type_union": null, "type_nested": {"num": 239, "street": "Baker Street", 
"city": "London", "state": "London", "zip": "NW1 6XE"}, "type_bytes": {"bytes": 
"\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
"type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
123456, "type_decimal_bytes": {"bytes": "\u0007?"}, "type_decimal_fixed": [7, 
-48]}, {"name": "Charlie", "favorite_number": null, "favorite_color": "blue", 
"type_long_test": 1337, "type_double_test": 1.337, "type_null_test": null, 
"type_bool_test": false, "type_array_string": [], "type_array_boolean": [], 
"type_nullable_array": null, "type_enum": "RED", "type_map": {}, "type_fixed": 
null, "type_union": null, "type_nested": {"num": 239, "street": "Baker Street", 
"city": "London", "state": "London", "zip": "NW1 6XE"}, "type_bytes": {"bytes": 
"\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
"type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
123456, "type_decimal_bytes": {"bytes": "\u0007?"}, "type_decimal_fixed": [7, 
-48]}]
 received: [{"name": "Alyssa", "favorite_number": 256, "favorite_color": null, 
"type_long_test": null, "type_double_test": 123.45, "type_null_test": null, 
"type_bool_test": true, "type_array_string": ["ELEMENT 1", "ELEMENT 2"], 
"type_array_boolean": [true, false], "type_nullable_array": null, "type_enum": 
"GREEN", "type_map": {"KEY 2": 17554, "KEY 1": 8546456}, "type_fixed": null, 
"type_union": null, "type_nested": {"num": 239, "street": "Baker Street", 
"city": "London", "state": "London", "zip": "NW1 6XE"}, "type_bytes": {"bytes": 
"\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
"type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
123456, "type_decimal_bytes": {"bytes": "\u0007??"}, "type_decimal_fixed": [7, 
-48]}, {"name": "Charlie", "favorite_number": null, "favorite_color": "blue", 
"type_long_test": 1337, "type_double_test": 1.337, "type_null_test": null, 
"type_bool_test": false, "type_array_string": [], "type_array_boolean": [], 
"type_nullable_array": null, "type_enum": "RED", "type_map": {}, "type_fixed": 
null, "type_union": null, "type_nested": {"num": 239, "street": "Baker Street", 
"city": "London", "state": "London", "zip": "NW1 6XE"}, "type_bytes": {"bytes": 
"\u\u\u\u\u\u\u\u\u\u"}, "type_date": 
2014-03-01, "type_time_millis": 12:12:12.000, "type_time_micros": 123456, 
"type_timestamp_millis": 2014-03-01T12:12:12.321Z, "type_timestamp_micros": 
123456, "type_decimal_bytes": {"bytes": "\u0007??"}, "type_decimal_fixed": [7, 
-48]}]
at 
org.apache.flink.formats.avro.typeutils.AvroTypeExtractionTest.after(AvroTypeExtractionTest.java:76)



Comparing “expected” with “received”, there is really some question mark 
difference.

For example, in “expected’, it’s

"type_decimal_bytes": {"bytes": "\u0007?”}

While in “received”, it’s 

"type_decimal_bytes": {"bytes": "\u0007??"}


I would really appreciate it if anyone could shed some light on this. Thanks

Ethan

[DISCUSS] Upgrade kinesis connector to Apache 2.0 License and include it in official release

2019-08-19 Thread Bowen Li
Hi all,

A while back we discussed upgrading flink-connector-kinesis module to
Apache 2.0 license so that its jar can be published as part of official
Flink releases. Given we have a large user base using Flink with
kinesis/dynamodb streams, it'll free users from building and maintaining
the module themselves, and improve user and developer experience. A ticket
was created [1] but has been idle mainly due to new releases of some aws
libs are not available yet then.

As of today I see that all flink-connector-kinesis's aws dependencies have
been updated to Apache 2.0 license and are released. They include:

- aws-java-sdk-kinesis
- aws-java-sdk-sts
- amazon-kinesis-client
- amazon-kinesis-producer (Apache 2.0 from 0.13.1, released 18 days ago) [2]
- dynamodb-streams-kinesis-adapter (Apache 2.0 from 1.5.0, released 7 days
ago) [3]

Therefore, I'd suggest we kick off the initiative and aim for release 1.10
which is roughly 3 months away, leaving us plenty of time to finish.
According to @Dyana 's comment in the ticket [1], seems some large chunks
of changes need to be made into multiple parts than simply upgrading lib
versions, so we can further break the JIRA down into sub-tasks to limit
scope of each change for easier code review.

@Dyana would you still be interested in carrying the responsibility and
forwarding the effort?

Thanks,
Bowen

[1] https://issues.apache.org/jira/browse/FLINK-12847
[2] https://github.com/awslabs/amazon-kinesis-producer/releases
[3] https://github.com/awslabs/dynamodb-streams-kinesis-adapter/releases


Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-19 Thread Stephan Ewen
@Xintong: Concerning "wait for memory users before task dispose and memory
release": I agree, that's how it should be. Let's try it out.

@Xintong @Jingsong: Concerning " JVM does not wait for GC when allocating
direct memory buffer": There seems to be pretty elaborate logic to free
buffers when allocating new ones. See
http://hg.openjdk.java.net/jdk8u/jdk8u-dev/jdk/file/tip/src/share/classes/java/nio/Bits.java#l643

@Till: Maybe. If we assume that the JVM default works (like going with
option 2 and not setting "-XX:MaxDirectMemorySize" at all), then I think it
should be okay to set "-XX:MaxDirectMemorySize" to
"off_heap_managed_memory + direct_memory" even if we use RocksDB. That is a
big if, though, I honestly have no idea :D Would be good to understand
this, though, because this would affect option (2) and option (1.2).

On Mon, Aug 19, 2019 at 4:44 PM Xintong Song  wrote:

> Thanks for the inputs, Jingsong.
>
> Let me try to summarize your points. Please correct me if I'm wrong.
>
>- Memory consumers should always avoid returning memory segments to
>memory manager while there are still un-cleaned structures / threads
> that
>may use the memory. Otherwise, it would cause serious problems by having
>multiple consumers trying to use the same memory segment.
>- JVM does not wait for GC when allocating direct memory buffer.
>Therefore even we set proper max direct memory size limit, we may still
>encounter direct memory oom if the GC cleaning memory slower than the
>direct memory allocation.
>
> Am I understanding this correctly?
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 4:21 PM JingsongLee  .invalid>
> wrote:
>
> > Hi stephan:
> >
> > About option 2:
> >
> > if additional threads not cleanly shut down before we can exit the task:
> > In the current case of memory reuse, it has freed up the memory it
> >  uses. If this memory is used by other tasks and asynchronous threads
> >  of exited task may still be writing, there will be concurrent security
> >  problems, and even lead to errors in user computing results.
> >
> > So I think this is a serious and intolerable bug, No matter what the
> >  option is, it should be avoided.
> >
> > About direct memory cleaned by GC:
> > I don't think it is a good idea, I've encountered so many situations
> >  that it's too late for GC to cause DirectMemory OOM. Release and
> >  allocate DirectMemory depend on the type of user job, which is
> >  often beyond our control.
> >
> > Best,
> > Jingsong Lee
> >
> >
> > --
> > From:Stephan Ewen 
> > Send Time:2019年8月19日(星期一) 15:56
> > To:dev 
> > Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> > TaskExecutors
> >
> > My main concern with option 2 (manually release memory) is that segfaults
> > in the JVM send off all sorts of alarms on user ends. So we need to
> > guarantee that this never happens.
> >
> > The trickyness is in tasks that uses data structures / algorithms with
> > additional threads, like hash table spill/read and sorting threads. We
> need
> > to ensure that these cleanly shut down before we can exit the task.
> > I am not sure that we have that guaranteed already, that's why option 1.1
> > seemed simpler to me.
> >
> > On Mon, Aug 19, 2019 at 3:42 PM Xintong Song 
> > wrote:
> >
> > > Thanks for the comments, Stephan. Summarized in this way really makes
> > > things easier to understand.
> > >
> > > I'm in favor of option 2, at least for the moment. I think it is not
> that
> > > difficult to keep it segfault safe for memory manager, as long as we
> > always
> > > de-allocate the memory segment when it is released from the memory
> > > consumers. Only if the memory consumer continue using the buffer of
> > memory
> > > segment after releasing it, in which case we do want the job to fail so
> > we
> > > detect the memory leak early.
> > >
> > > For option 1.2, I don't think this is a good idea. Not only because the
> > > assumption (regular GC is enough to clean direct buffers) may not
> always
> > be
> > > true, but also it makes harder for finding problems in cases of memory
> > > overuse. E.g., user configured some direct memory for the user
> libraries.
> > > If the library actually use more direct memory then configured, which
> > > cannot be cleaned by GC because they are still in use, may lead to
> > overuse
> > > of the total container memory. In that case, if it didn't touch the JVM
> > > default max direct memory limit, we cannot get a direct memory OOM and
> it
> > > will become super hard to understand which part of the configuration
> need
> > > to be updated.
> > >
> > > For option 1.1, it has the similar problem as 1.2, if the exceeded
> direct
> > > memory does not reach the max direct memory limit specified by the
> > > dedicated parameter. I think it is slightly better than 1.2, only
> because
> > > we can tune the parameter.
> > >
> > > Thank you~
> > >
> > > 

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Tzu-Li (Gordon) Tai
Thanks for the comments and fast fixes.

@Becket Qin  I've quickly looked at the changes to
the PubSub connector. Given that it is a API-breaking change and is quite
local as a configuration change, I've decided to include that change in RC3.
@Jark @Timo Walther  I'll be adding FLINK-13699 as well.

Quick update regarding the LICENSE issue with flink-runtime-web: I've
doubled checked this and the licenses for the new bundled Javascript
dependencies are actually already correctly present under the root
licenses-binary/ directory, so we actually don't need additional changes
for this.

I've started to create RC3 now, will post the vote as soon as it is ready.

Cheers,
Gordon

On Mon, Aug 19, 2019 at 3:28 PM Stephan Ewen  wrote:

> Looking at FLINK-13699, it seems to be very local to Table API and HBase
> connector.
> We can cherry-pick that without re-running distributed tests.
>
>
> On Mon, Aug 19, 2019 at 1:46 PM Till Rohrmann 
> wrote:
>
> > I've merged the fix for FLINK-13752. Hence we are good to go to create
> the
> > new RC.
> >
> > Cheers,
> > Till
> >
> > On Mon, Aug 19, 2019 at 1:30 PM Timo Walther  wrote:
> >
> > > I support Jark's fix for FLINK-13699 because it would be disappointing
> > > if both DDL and connectors are ready to handle DATE/TIME/TIMESTAMP but
> a
> > > little component in the middle of the stack is preventing an otherwise
> > > usable feature. The changes are minor.
> > >
> > > Thanks,
> > > Timo
> > >
> > >
> > > Am 19.08.19 um 13:24 schrieb Jark Wu:
> > > > Hi Gordon,
> > > >
> > > > I agree that we should pick the minimal set of changes to shorten the
> > > > release testing time.
> > > > However, I would like to include FLINK-13699 into RC3. FLINK-13699
> is a
> > > > critical DDL issue, and is a small change to flink table (won't
> affect
> > > the
> > > > runtime feature and stability).
> > > > I will do some tests around sql and blink planner if the RC3 include
> > this
> > > > fix.
> > > >
> > > > But if the community against to include it, I'm also fine with having
> > it
> > > in
> > > > the next minor release.
> > > >
> > > > Thanks,
> > > > Jark
> > > >
> > > > On Mon, 19 Aug 2019 at 16:16, Stephan Ewen  wrote:
> > > >
> > > >> +1 for Gordon's approach.
> > > >>
> > > >> If we do that, we can probably skip re-testing everything and mainly
> > > need
> > > >> to verify the release artifacts (signatures, build from source,
> etc.).
> > > >>
> > > >> If we open the RC up for changes, I fear a lot of small issues will
> > > rush in
> > > >> and destabilize the candidate again, meaning we have to do another
> > > larger
> > > >> testing effort.
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Aug 19, 2019 at 9:48 AM Becket Qin 
> > > wrote:
> > > >>
> > > >>> Hi Gordon,
> > > >>>
> > > >>> I remember we mentioned earlier that if there is an additional RC,
> we
> > > can
> > > >>> piggyback the GCP PubSub API change (
> > > >>> https://issues.apache.org/jira/browse/FLINK-13231). It is a small
> > > patch
> > > >> to
> > > >>> avoid future API change. So should be able to merge it very
> shortly.
> > > >> Would
> > > >>> it be possible to include that into RC3 as well?
> > > >>>
> > > >>> Thanks,
> > > >>>
> > > >>> Jiangjie (Becket) Qin
> > > >>>
> > > >>> On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai <
> > > tzuli...@apache.org
> > > >>>
> > > >>> wrote:
> > > >>>
> > >  Hi,
> > > 
> > >  https://issues.apache.org/jira/browse/FLINK-13752 turns out to be
> > an
> > >  actual
> > >  blocker, so we would have to close this RC now in favor of a new
> > one.
> > > 
> > >  Since we are already quite past the planned release time for
> 1.9.0,
> > I
> > > >>> would
> > >  like to limit the new changes included in RC3 to only the
> following:
> > >  - https://issues.apache.org/jira/browse/FLINK-13752
> > >  - Fix license and notice file issues that Kurt had found with
> > >  flink-runtime-web and flink-state-processing-api
> > > 
> > >  This means that I will not be creating RC3 with the release-1.9
> > branch
> > > >> as
> > >  is, but essentially only cherry-picking the above mentioned
> changes
> > on
> > > >>> top
> > >  of RC2.
> > >  The minimal set of changes on top of RC2 should allow us to carry
> > most
> > > >> if
> > >  not all of the already existing votes without another round of
> > > >> extensive
> > >  testing, and allow us to have a shortened voting time.
> > > 
> > >  I understand that there are other issues mentioned in this thread
> > that
> > > >>> are
> > >  already spotted and merged to release-1.9, especially for the
> Blink
> > > >>> planner
> > >  and DDL, but I suggest not to include them in RC3.
> > >  I think it would be better to collect all the remaining issues for
> > > >> those
> > >  over a period of time, and include them as 1.9.1 which can ideally
> > > also
> > >  happen a few weeks soon after 1.9.0.
> > > 
> > >  What do you think? If 

Re: [DISCUSS][CODE STYLE] Breaking long function argument lists and chained method calls

2019-08-19 Thread Stephan Ewen
I personally prefer not to break lines with few parameters.
It just feels unnecessarily clumsy to parse the breaks if there are only
two or three arguments with short names.

So +1
  - for a hard line length limit
  - allowing arguments on the same line if below that limit
  - with consistent argument breaking when that length is exceeded
  - developers can break before that if they feel it helps with readability.

This should be similar to what we have, except for enforcing the line
length limit.

I think our Java guide originally suggested 120 characters line length, but
we can reduce that to 100 if a majority argues for that, but it is separate
discussion.
We uses shorter lines in Scala (100 chars) because Scala code becomes
harder to read faster with long lines.


On Mon, Aug 19, 2019 at 10:45 AM Andrey Zagrebin 
wrote:

> Hi Everybody,
>
> Thanks for your feedback guys and sorry for not getting back to the
> discussion for some time.
>
> @SHI Xiaogang
> About breaking lines for thrown exceptions:
> Indeed that would prevent growing the throw clause indefinitely.
> I am a bit concerned about putting the right parenthesis and/or throw
> clause on the next line
> because in general we do not it and there are a lot of variations of how
> and what to put to the next line so it needs explicit memorising.
> Also, we do not have many checked exceptions and usually avoid them.
> Although I am not a big fan of many function arguments either but this
> seems to be a bigger problem in the code base.
> I would be ok to not enforce anything for exceptions atm.
>
> @Chesnay Schepler 
> Thanks for mentioning automatic checks.
> Indeed, pointing out this kind of style issues during PR reviews is very
> tedious
> and cannot really force it without automated tools.
> I would still consider the outcome of this discussion as a soft
> recommendation atm (which we also have for some other things in the code
> style draft).
> We need more investigation about how to enforce things. I am not so
> knowledgable about code style/IDE checks.
> From the first glance I also do not see a simple way. If somebody has more
> insight please share your experience.
>
> @Biao Liu 
> Line length limitation:
> I do not see anything for Java, only for Scala: 100 (also enforced by build
> AFAIK).
> From what I heard there has been already some discussion about the hard
> limit for the line length.
> Although quite some people are in favour of it (including me) and seems to
> be a nice limitation,
> there are some practical implication about it.
> Historically, Flink did not have any code style checks and huge chunks of
> code base have to be reformatted destroying the commit history.
> Another thing is value for the limit. Nowadays, we have wide screens and do
> not often even need to scroll.
> Nevertheless, we can kick off another discussion about the line length
> limit and enforcing it.
> Atm I see people to adhere to a soft recommendation of 120 line length for
> Java because it is usually a bit more verbose comparing to Scala.
>
> *Question 1*:
> I would be ok to always break line if there is more than one chained call.
> There are a lot of places where this is only one short call I would not
> break line in this case.
> If it is too confusing I would be ok to stick to the rule to break either
> all or none.
> Thanks for pointing out this explicitly: For a chained method calls, the
> new line should be started with the dot.
> I think it should be also part of the rule if forced.
>
> *Question 2:*
> The indent of new line should be 1 tab or 2 tabs? (I assume it matters only
> for function arguments)
> This is a good point which again probably deserves a separate thread.
> We also had an internal discussion about it. I would be also in favour of
> resolving it into one way.
> Atm we indeed have 2 ways in our code base which are again soft
> recommendations.
> The problem is mostly with enforcing it automatically.
> The approach with extra indentation also needs IDE setup otherwise it is
> annoying
> that after every function cut/paste, e.g. Idea changes the format to one
> indentation automatically and often people do not notice it.
>
> I suggest we still finish this discussion to a point of achieving a soft
> recommendation which we can also reconsider
> when there are more ideas about automatically enforcing these things.
>
> Best,
> Andrey
>
> On Sat, Aug 3, 2019 at 7:51 AM SHI Xiaogang 
> wrote:
>
> > Hi Chesnay,
> >
> > Thanks a lot for your reminder.
> >
> > For Intellij settings, the style i proposed can be configured as below
> > * Method declaration parameters: chop down if long
> > * align when multiple: YES
> > * new line after '(': YES
> > * place ')' on new line: YES
> > * Method call arguments: chop down if long
> > * align when multiple: YES
> > * take priority over call chain wrapping: YES
> > * new line after '(': YES
> > * place ')' on new line: YES
> > * Throws list: chop down if long
> > * 

[jira] [Created] (FLINK-13788) Document state migration constraints on keys

2019-08-19 Thread Seth Wiesman (Jira)
Seth Wiesman created FLINK-13788:


 Summary: Document state migration constraints on keys
 Key: FLINK-13788
 URL: https://issues.apache.org/jira/browse/FLINK-13788
 Project: Flink
  Issue Type: Improvement
Reporter: Seth Wiesman


[https://lists.apache.org/thread.html/79d74334baecbcbd765cead2f90df470d3ffe55b839f208e9695a6e2@%3Cuser.flink.apache.org%3E]

 

Key migrations are not allowed to prevent: 

 

1) Key clashes

2) Change in key group assignment

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

2019-08-19 Thread Stephan Ewen
@Andrey Will you open a PR to add this to the code style?

On Mon, Aug 19, 2019 at 11:51 AM Andrey Zagrebin 
wrote:

> Hi All,
>
> It looks like this proposal has an approval and we can conclude this
> discussion.
> Additionally, I agree with Piotr we should really force the proven good
> reasoning for setting the capacity to avoid confusion, redundancy and other
> already mentioned things while reading and maintaining the code.
> Ideally the need of setting the capacity should be either immediately clear
> (e.g. perf etc) or explained in comments if it is non-trivial.
> Although, it can easily enter a grey zone, so I would not demand strictly
> performance measurement proof e.g. if the size is known and it is "per
> record" code.
> At the end of the day it is a decision of the code developer and reviewer.
>
> The conclusion is then:
> Set the initial capacity only if there is a good proven reason to do it.
> Otherwise do not clutter the code with it.
>
> Best,
> Andrey
>
> On Thu, Aug 1, 2019 at 5:10 PM Piotr Nowojski  wrote:
>
> > Hi,
> >
> > > - a bit more code, increases maintenance burden.
> >
> > I think there is even more to that. It’s almost like a code duplication,
> > albeit expressed in very different way, with all of the drawbacks of
> > duplicated code: initial capacity can drift out of sync, causing
> confusion.
> > Also it’s not “a bit more code”, it might be non trivial
> > reasoning/calculation how to set the initial value. Whenever we change
> > something/refactor the code, "maintenance burden” will mostly come from
> > that.
> >
> > Also I think this just usually falls under a premature optimisation rule.
> >
> > Besides:
> >
> > > The conclusion is the following at the moment:
> > > Only set the initial capacity if you have a good idea about the
> expected
> > size.
> >
> > I would add a clause to set the initial capacity “only for good proven
> > reasons”. It’s not about whether we can set it, but whether it makes
> sense
> > to do so (to avoid the before mentioned "maintenance burden”).
> >
> > Piotrek
> >
> > > On 1 Aug 2019, at 14:41, Xintong Song  wrote:
> > >
> > > +1 on setting initial capacity only when have good expectation on the
> > > collection size.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin 
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> As you probably already noticed, Stephan has triggered a discussion
> > thread
> > >> about code style guide for Flink [1]. Recently we were discussing
> > >> internally some smaller concerns and I would like start separate
> threads
> > >> for them.
> > >>
> > >> This thread is about creating collections always with initial
> capacity.
> > As
> > >> you might have seen, some parts of our code base always initialise
> > >> collections with some non-default capacity. You can even activate a
> > check
> > >> in IntelliJ Idea that can monitor and highlight creation of collection
> > >> without initial capacity.
> > >>
> > >> Pros:
> > >> - performance gain if there is a good reasoning about initial capacity
> > >> - the capacity is always deterministic and does not depend on any
> > changes
> > >> of its default value in Java
> > >> - easy to follow: always initialise, has IDE support for detection
> > >>
> > >> Cons (for initialising w/o good reasoning):
> > >> - We are trying to outsmart JVM. When there is no good reasoning about
> > >> initial capacity, we can rely on JVM default value.
> > >> - It is even confusing e.g. for hash maps as the real size depends on
> > the
> > >> load factor.
> > >> - It would only add minor performance gain.
> > >> - a bit more code, increases maintenance burden.
> > >>
> > >> The conclusion is the following at the moment:
> > >> Only set the initial capacity if you have a good idea about the
> expected
> > >> size.
> > >>
> > >> Please, feel free to share you thoughts.
> > >>
> > >> Best,
> > >> Andrey
> > >>
> > >> [1]
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3ced91df4b-7cab-4547-a430-85bc710fd...@apache.org%3E
> > >>
> >
> >
>


Re: [DISCUSS][CODE STYLE] Usage of Java Optional

2019-08-19 Thread Stephan Ewen
For the use of optional in private methods: It sounds fine to me, because
it means it is strictly class-internal (between methods and helper methods)
and does not leak beyond that.


On Mon, Aug 19, 2019 at 5:53 PM Andrey Zagrebin 
wrote:

> Hi all,
>
> Sorry for not getting back to this discussion for some time.
> It looks like in general we agree on the initially suggested points:
>
>- Always use Optional only to return nullable values in the API/public
>methods
>   - Only if you can prove that Optional usage would lead to a
>   performance degradation in critical code then return nullable value
>   directly and annotate it with @Nullable
>- Passing an Optional argument to a method can be allowed if it is
>within a private helper method and simplifies the code
>- Optional should not be used for class fields
>
> The first point can be also elaborated by explicitly forbiding
> Optional/Nullable parameters in public methods.
> In general we can always avoid Optional parameters by either overloading
> the method or using a pojo with a builder to pass a set of parameters.
>
> The third point does not prevent from using @Nullable/@Nonnull for class
> fields.
> If we agree to not use Optional for fields then not sure I see any use case
> for SerializableOptional (please comment on that if you have more details).
>
> @Jingsong Lee
> Using Optional in Maps.
> I can see this as a possible use case.
> I would leave this decision to the developer/reviewer to reason about it
> and keep the scope of this discussion to the variables/parameters as it
> seems to be the biggest point of friction atm.
>
> Finally, I see a split regarding the second point:  argument to a method can be allowed if it is within a private helper method
> and simplifies the code>.
> There are people who have a strong opinion against allowing it.
> Let's vote then for whether to allow it or not.
> So far as I see we have the following votes (correct me if wrong and add
> more if you want):
> Piotr+1
> Biao+1
> Timo   -1
> Yu   -1
> Aljoscha -1
> Till  +1
> Igal+1
> Dawid-1
> Me +1
>
> So far: +5 / -4 (Optional arguments in private methods)
>
> Best,
> Andrey
>
>
> On Tue, Aug 6, 2019 at 8:53 AM Piotr Nowojski  wrote:
>
> > Hi Qi,
> >
> > > For example, SingleInputGate is already creating Optional for every
> > BufferOrEvent in getNextBufferOrEvent(). How much performance gain would
> we
> > get if it’s replaced by null check?
> >
> > When I was introducing it there, I have micro-benchmarked this change,
> and
> > there was no visible throughput change in a pure network only micro
> > benchmark (with whole Flink running around it any changes would be even
> > less visible).
> >
> > Piotrek
> >
> > > On 5 Aug 2019, at 15:20, Till Rohrmann  wrote:
> > >
> > > I'd be in favour of
> > >
> > > - Optional for method return values if not performance critical
> > > - Optional can be used for internal methods if it makes sense
> > > - No optional fields
> > >
> > > Cheers,
> > > Till
> > >
> > > On Mon, Aug 5, 2019 at 12:07 PM Aljoscha Krettek 
> > > wrote:
> > >
> > >> Hi,
> > >>
> > >> I’m also in favour of using Optional only for method return values. I
> > >> could be persuaded to allow them for parameters of internal methods
> but
> > I’m
> > >> sceptical about that.
> > >>
> > >> Aljoscha
> > >>
> > >>> On 2. Aug 2019, at 15:35, Yu Li  wrote:
> > >>>
> > >>> TL; DR: I second Timo that we should use Optional only as method
> return
> > >>> type for non-performance critical code.
> > >>>
> > >>> From the example given on our AvroFactory [1] I also noticed that
> > >> Jetbrains
> > >>> marks the OptionalUsedAsFieldOrParameterType inspection as a warning.
> > >> It's
> > >>> relatively easy to understand why it's not suggested to use
> (java.util)
> > >>> Optional as a field since it's not serializable. What made me feel
> > >> curious
> > >>> is that why we shouldn't use it as a parameter type, so I did some
> > >>> investigation and here is what I found:
> > >>>
> > >>> There's a JB blog talking about java8 top tips [2] where we could
> find
> > >> the
> > >>> advice around Optional, there I found another blog telling about the
> > >>> pragmatic approach of using Optional [3]. Reading further we could
> see
> > >> the
> > >>> reason why we shouldn't use Optional as parameter type, please allow
> me
> > >> to
> > >>> quote here:
> > >>>
> > >>> It is often the case that domain objects hang about in memory for a
> > fair
> > >>> while, as processing in the application occurs, making each optional
> > >>> instance rather long-lived (tied to the lifetime of the domain
> object).
> > >> By
> > >>> contrast, the Optionalinstance returned from the getter is likely to
> be
> > >>> very short-lived. The caller will call the getter, interpret the
> > result,
> > >>> and then move on. If you know anything about garbage collection
> you'll
> > >> know
> > >>> that the JVM 

Re: [DISCUSS][CODE STYLE] Usage of Java Optional

2019-08-19 Thread Andrey Zagrebin
Hi all,

Sorry for not getting back to this discussion for some time.
It looks like in general we agree on the initially suggested points:

   - Always use Optional only to return nullable values in the API/public
   methods
  - Only if you can prove that Optional usage would lead to a
  performance degradation in critical code then return nullable value
  directly and annotate it with @Nullable
   - Passing an Optional argument to a method can be allowed if it is
   within a private helper method and simplifies the code
   - Optional should not be used for class fields

The first point can be also elaborated by explicitly forbiding
Optional/Nullable parameters in public methods.
In general we can always avoid Optional parameters by either overloading
the method or using a pojo with a builder to pass a set of parameters.

The third point does not prevent from using @Nullable/@Nonnull for class
fields.
If we agree to not use Optional for fields then not sure I see any use case
for SerializableOptional (please comment on that if you have more details).

@Jingsong Lee
Using Optional in Maps.
I can see this as a possible use case.
I would leave this decision to the developer/reviewer to reason about it
and keep the scope of this discussion to the variables/parameters as it
seems to be the biggest point of friction atm.

Finally, I see a split regarding the second point: .
There are people who have a strong opinion against allowing it.
Let's vote then for whether to allow it or not.
So far as I see we have the following votes (correct me if wrong and add
more if you want):
Piotr+1
Biao+1
Timo   -1
Yu   -1
Aljoscha -1
Till  +1
Igal+1
Dawid-1
Me +1

So far: +5 / -4 (Optional arguments in private methods)

Best,
Andrey


On Tue, Aug 6, 2019 at 8:53 AM Piotr Nowojski  wrote:

> Hi Qi,
>
> > For example, SingleInputGate is already creating Optional for every
> BufferOrEvent in getNextBufferOrEvent(). How much performance gain would we
> get if it’s replaced by null check?
>
> When I was introducing it there, I have micro-benchmarked this change, and
> there was no visible throughput change in a pure network only micro
> benchmark (with whole Flink running around it any changes would be even
> less visible).
>
> Piotrek
>
> > On 5 Aug 2019, at 15:20, Till Rohrmann  wrote:
> >
> > I'd be in favour of
> >
> > - Optional for method return values if not performance critical
> > - Optional can be used for internal methods if it makes sense
> > - No optional fields
> >
> > Cheers,
> > Till
> >
> > On Mon, Aug 5, 2019 at 12:07 PM Aljoscha Krettek 
> > wrote:
> >
> >> Hi,
> >>
> >> I’m also in favour of using Optional only for method return values. I
> >> could be persuaded to allow them for parameters of internal methods but
> I’m
> >> sceptical about that.
> >>
> >> Aljoscha
> >>
> >>> On 2. Aug 2019, at 15:35, Yu Li  wrote:
> >>>
> >>> TL; DR: I second Timo that we should use Optional only as method return
> >>> type for non-performance critical code.
> >>>
> >>> From the example given on our AvroFactory [1] I also noticed that
> >> Jetbrains
> >>> marks the OptionalUsedAsFieldOrParameterType inspection as a warning.
> >> It's
> >>> relatively easy to understand why it's not suggested to use (java.util)
> >>> Optional as a field since it's not serializable. What made me feel
> >> curious
> >>> is that why we shouldn't use it as a parameter type, so I did some
> >>> investigation and here is what I found:
> >>>
> >>> There's a JB blog talking about java8 top tips [2] where we could find
> >> the
> >>> advice around Optional, there I found another blog telling about the
> >>> pragmatic approach of using Optional [3]. Reading further we could see
> >> the
> >>> reason why we shouldn't use Optional as parameter type, please allow me
> >> to
> >>> quote here:
> >>>
> >>> It is often the case that domain objects hang about in memory for a
> fair
> >>> while, as processing in the application occurs, making each optional
> >>> instance rather long-lived (tied to the lifetime of the domain object).
> >> By
> >>> contrast, the Optionalinstance returned from the getter is likely to be
> >>> very short-lived. The caller will call the getter, interpret the
> result,
> >>> and then move on. If you know anything about garbage collection you'll
> >> know
> >>> that the JVM handles these short-lived objects well. In addition, there
> >> is
> >>> more potential for hotspot to remove the costs of the Optional instance
> >>> when it is short lived. While it is easy to claim this is "premature
> >>> optimization", as engineers it is our responsibility to know the limits
> >> and
> >>> capabilities of the system we work with and to choose carefully the
> point
> >>> where it should be stressed.
> >>>
> >>> And there's another JB blog about code smell on Null [4], which I'd
> also
> >>> suggest to read(smile).
> >>>
> >>> [1]
> >>>
> >>
> 

[jira] [Created] (FLINK-13787) PrometheusPushGatewayReporter does not cleanup TM metrics when run on kubernetes

2019-08-19 Thread Kaibo Zhou (Jira)
Kaibo Zhou created FLINK-13787:
--

 Summary: PrometheusPushGatewayReporter does not cleanup TM metrics 
when run on kubernetes
 Key: FLINK-13787
 URL: https://issues.apache.org/jira/browse/FLINK-13787
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.8.1, 1.7.2, 1.9.0
Reporter: Kaibo Zhou


I have run a flink job on kubernetes and use PrometheusPushGatewayReporter, I 
can see the metrics from the flink jobmanager and taskmanager from the push 
gateway's UI.

When I cancel the job, I found the jobmanager's metrics disappear, but the 
taskmanager's metrics still exist, even though I have set the 
_deleteOnShutdown_ to true_._

The configuration is:
{code:java}
metrics.reporters: "prom"
metrics.reporter.prom.class: 
"org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter"
metrics.reporter.prom.jobName: "WordCount"
metrics.reporter.prom.host: "localhost"
metrics.reporter.prom.port: "9091"
metrics.reporter.prom.randomJobNameSuffix: "true"
metrics.reporter.prom.filterLabelValueCharacters: "true"
metrics.reporter.prom.deleteOnShutdown: "true"
{code}
 

Other people have also encountered this problem: 
[link|[https://stackoverflow.com/questions/54420498/flink-prometheus-push-gateway-reporter-delete-metrics-on-job-shutdown]].

And another similar issue: 
[FLINK-11457|https://issues.apache.org/jira/browse/FLINK-11457].

 

As prometheus is a very import metrics system on kubernetes, if we can solve 
this problem, it is beneficial for users to monitor their flink jobs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS] Reducing build times

2019-08-19 Thread Robert Metzger
Hi all,

I have summarized all arguments mentioned so far + some additional research
into a Wiki page here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279

I'm happy to hear further comments on my summary! I'm pretty sure we can
find more pro's and con's for the different options.

My opinion after looking at the options:

   - Flink relies on an outdated build tool (Maven), while a good
   alternative is well-established (gradle), and will likely provide a much
   better CI and local build experience through incremental build and cached
   intermediates.
   Scripting around Maven, or splitting modules / test execution /
   repositories won't solve this problem. We should rather spend the effort in
   migrating to a modern build tool which will provide us benefits in the long
   run.
   - Flink relies on a fairly slow build service (Travis CI), while simply
   putting more money onto the problem could cut the build time at least in
   half.
   We should consider using a build service that provides bigger machines
   to solve our build time problem.

My opinion is based on many assumptions (gradle is actually as fast as
promised (haven't used it before), we can build Flink with gradle, we find
sponsors for bigger build machines) that we need to test first through PoCs.

Best,
Robert




On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek 
wrote:

> I did a quick test: a normal "mvn clean install -DskipTests
> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine
> takes about 14 minutes. After removing all mentions of maven-shade-plugin
> the build time goes down to roughly 11.5 minutes. (Obviously the resulting
> Flink won’t work, because some expected stuff is not packaged and most of
> the end-to-end tests use the shade plugin to package the jars for testing.
>
> Aljoscha
>
> > On 18. Aug 2019, at 19:52, Robert Metzger  wrote:
> >
> > Hi all,
> >
> > I wanted to understand the impact of the hardware we are using for
> running
> > our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
> > They are using Google Cloud Compute Engine *n1-standard-2* instances.
> > Running a full "mvn clean verify" takes *03:32 h* on such a machine type.
> >
> > Running the same workload on a 32 virtual cores, 64 gb machine, takes
> *1:21
> > h*.
> >
> > What is interesting are the per-module build time differences.
> > Modules which are parallelizing tests well greatly benefit from the
> > additional cores:
> > "flink-tests" 36:51 min vs 4:33 min
> > "flink-runtime" 23:41 min vs 3:47 min
> > "flink-table-planner" 15:54 min vs 3:13 min
> >
> > On the other hand, we have modules which are not parallel at all:
> > "flink-connector-kafka": 16:32 min vs 15:19 min
> > "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> > Also, the checkstyle plugin is not scaling at all.
> >
> > Chesnay reported some significant speedups by reusing forks.
> > I don't know how much effort it would be to make the Kafka tests
> > parallelizable. In total, they currently use 30 minutes on the big
> machine
> > (while 31 CPUs are idling :) )
> >
> > Let me know what you think about these results. If the community is
> > generally interested in further investigating into that direction, I
> could
> > look into software to orchestrate this, as well as sponsors for such an
> > infrastructure.
> >
> > [1] https://docs.travis-ci.com/user/reference/overview/
> >
> >
> > On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler 
> wrote:
> >
> >> @Aljoscha Shading takes a few minutes for a full build; you can see this
> >> quite easily by looking at the compile step in the misc profile
> >> ; all modules that
> >> longer than a fraction of a section are usually caused by shading lots
> >> of classes. Note that I cannot tell you how much of this is spent on
> >> relocations, and how much on writing the jar.
> >>
> >> Personally, I'd very much like us to move all shading to flink-shaded;
> >> this would finally allows us to use newer maven versions without needing
> >> cumbersome workarounds for flink-dist. However, this isn't a trivial
> >> affair in some cases; IIRC calcite could be difficult to handle.
> >>
> >> On another note, this would also simplify switching the main repo to
> >> another build system, since you would no longer had to deal with
> >> relocations, just packaging + merging NOTICE files.
> >>
> >> @BowenLi I disagree, flink-shaded does not include any tests,  API
> >> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
> >> and flink-dist, where both relocate dependencies and one is bundled by
> >> the other), and, most importantly, CI (and really, without CI being
> >> covered in a PoC there's nothing to discuss).
> >>
> >> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> >>> Speaking of flink-shaded, do we have any idea what the impact of
> shading
> >> is on the build time? We could get rid of shading completely 

[jira] [Created] (FLINK-13783) Implement type inference for string functions

2019-08-19 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-13783:


 Summary: Implement type inference for string functions
 Key: FLINK-13783
 URL: https://issues.apache.org/jira/browse/FLINK-13783
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Reporter: Jingsong Lee
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13786) Implement type inference for other functions

2019-08-19 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-13786:


 Summary: Implement type inference for other functions
 Key: FLINK-13786
 URL: https://issues.apache.org/jira/browse/FLINK-13786
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Reporter: Jingsong Lee
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13785) Implement type inference for time functions

2019-08-19 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-13785:


 Summary: Implement type inference for time functions
 Key: FLINK-13785
 URL: https://issues.apache.org/jira/browse/FLINK-13785
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Reporter: Jingsong Lee
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13784) Implement type inference for math functions

2019-08-19 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-13784:


 Summary: Implement type inference for math functions
 Key: FLINK-13784
 URL: https://issues.apache.org/jira/browse/FLINK-13784
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Reporter: Jingsong Lee
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13782) Implement type inference for logic functions

2019-08-19 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-13782:


 Summary: Implement type inference for logic functions
 Key: FLINK-13782
 URL: https://issues.apache.org/jira/browse/FLINK-13782
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / API
Reporter: Jingsong Lee
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13780) Introduce ExpressionConverter to legacy planner

2019-08-19 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-13780:


 Summary: Introduce ExpressionConverter to legacy planner
 Key: FLINK-13780
 URL: https://issues.apache.org/jira/browse/FLINK-13780
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Legacy Planner
Reporter: Jingsong Lee
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13781) Use new Expression in RexNodeToExpressionConverter

2019-08-19 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-13781:


 Summary: Use new Expression in RexNodeToExpressionConverter
 Key: FLINK-13781
 URL: https://issues.apache.org/jira/browse/FLINK-13781
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Legacy Planner
Reporter: Jingsong Lee
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-19 Thread Xintong Song
Thanks for the inputs, Jingsong.

Let me try to summarize your points. Please correct me if I'm wrong.

   - Memory consumers should always avoid returning memory segments to
   memory manager while there are still un-cleaned structures / threads that
   may use the memory. Otherwise, it would cause serious problems by having
   multiple consumers trying to use the same memory segment.
   - JVM does not wait for GC when allocating direct memory buffer.
   Therefore even we set proper max direct memory size limit, we may still
   encounter direct memory oom if the GC cleaning memory slower than the
   direct memory allocation.

Am I understanding this correctly?

Thank you~

Xintong Song



On Mon, Aug 19, 2019 at 4:21 PM JingsongLee 
wrote:

> Hi stephan:
>
> About option 2:
>
> if additional threads not cleanly shut down before we can exit the task:
> In the current case of memory reuse, it has freed up the memory it
>  uses. If this memory is used by other tasks and asynchronous threads
>  of exited task may still be writing, there will be concurrent security
>  problems, and even lead to errors in user computing results.
>
> So I think this is a serious and intolerable bug, No matter what the
>  option is, it should be avoided.
>
> About direct memory cleaned by GC:
> I don't think it is a good idea, I've encountered so many situations
>  that it's too late for GC to cause DirectMemory OOM. Release and
>  allocate DirectMemory depend on the type of user job, which is
>  often beyond our control.
>
> Best,
> Jingsong Lee
>
>
> --
> From:Stephan Ewen 
> Send Time:2019年8月19日(星期一) 15:56
> To:dev 
> Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> TaskExecutors
>
> My main concern with option 2 (manually release memory) is that segfaults
> in the JVM send off all sorts of alarms on user ends. So we need to
> guarantee that this never happens.
>
> The trickyness is in tasks that uses data structures / algorithms with
> additional threads, like hash table spill/read and sorting threads. We need
> to ensure that these cleanly shut down before we can exit the task.
> I am not sure that we have that guaranteed already, that's why option 1.1
> seemed simpler to me.
>
> On Mon, Aug 19, 2019 at 3:42 PM Xintong Song 
> wrote:
>
> > Thanks for the comments, Stephan. Summarized in this way really makes
> > things easier to understand.
> >
> > I'm in favor of option 2, at least for the moment. I think it is not that
> > difficult to keep it segfault safe for memory manager, as long as we
> always
> > de-allocate the memory segment when it is released from the memory
> > consumers. Only if the memory consumer continue using the buffer of
> memory
> > segment after releasing it, in which case we do want the job to fail so
> we
> > detect the memory leak early.
> >
> > For option 1.2, I don't think this is a good idea. Not only because the
> > assumption (regular GC is enough to clean direct buffers) may not always
> be
> > true, but also it makes harder for finding problems in cases of memory
> > overuse. E.g., user configured some direct memory for the user libraries.
> > If the library actually use more direct memory then configured, which
> > cannot be cleaned by GC because they are still in use, may lead to
> overuse
> > of the total container memory. In that case, if it didn't touch the JVM
> > default max direct memory limit, we cannot get a direct memory OOM and it
> > will become super hard to understand which part of the configuration need
> > to be updated.
> >
> > For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> > memory does not reach the max direct memory limit specified by the
> > dedicated parameter. I think it is slightly better than 1.2, only because
> > we can tune the parameter.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen  wrote:
> >
> > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> > it a
> > > bit differently:
> > >
> > > We have the following two options:
> > >
> > > (1) We let MemorySegments be de-allocated by the GC. That makes it
> > segfault
> > > safe. But then we need a way to trigger GC in case de-allocation and
> > > re-allocation of a bunch of segments happens quickly, which is often
> the
> > > case during batch scheduling or task restart.
> > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> > >   - Another way could be to have a dedicated bookkeeping in the
> > > MemoryManager (option 1.2), so that this is a number independent of the
> > > "-XX:MaxDirectMemorySize" parameter.
> > >
> > > (2) We manually allocate and de-allocate the memory for the
> > MemorySegments
> > > (option 2). That way we need not worry about triggering GC by some
> > > threshold or bookkeeping, but it is harder to prevent segfaults. We
> need
> > to
> > > be very careful about when we release the 

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-19 Thread Till Rohrmann
Quick question for option 1.1 Stephan: Does this variant entail that we
distinguish between native and direct memory off heap managed memory? If
this is the case, then it won't be possible for users to run a streaming
job using RocksDB and a batch DataSet job on the same session cluster
unless they have configured the off heap managed memory to be twofold (e.g.
50% native, 50% direct memory).

Cheers,
Till

On Mon, Aug 19, 2019 at 4:21 PM JingsongLee 
wrote:

> Hi stephan:
>
> About option 2:
>
> if additional threads not cleanly shut down before we can exit the task:
> In the current case of memory reuse, it has freed up the memory it
>  uses. If this memory is used by other tasks and asynchronous threads
>  of exited task may still be writing, there will be concurrent security
>  problems, and even lead to errors in user computing results.
>
> So I think this is a serious and intolerable bug, No matter what the
>  option is, it should be avoided.
>
> About direct memory cleaned by GC:
> I don't think it is a good idea, I've encountered so many situations
>  that it's too late for GC to cause DirectMemory OOM. Release and
>  allocate DirectMemory depend on the type of user job, which is
>  often beyond our control.
>
> Best,
> Jingsong Lee
>
>
> --
> From:Stephan Ewen 
> Send Time:2019年8月19日(星期一) 15:56
> To:dev 
> Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for
> TaskExecutors
>
> My main concern with option 2 (manually release memory) is that segfaults
> in the JVM send off all sorts of alarms on user ends. So we need to
> guarantee that this never happens.
>
> The trickyness is in tasks that uses data structures / algorithms with
> additional threads, like hash table spill/read and sorting threads. We need
> to ensure that these cleanly shut down before we can exit the task.
> I am not sure that we have that guaranteed already, that's why option 1.1
> seemed simpler to me.
>
> On Mon, Aug 19, 2019 at 3:42 PM Xintong Song 
> wrote:
>
> > Thanks for the comments, Stephan. Summarized in this way really makes
> > things easier to understand.
> >
> > I'm in favor of option 2, at least for the moment. I think it is not that
> > difficult to keep it segfault safe for memory manager, as long as we
> always
> > de-allocate the memory segment when it is released from the memory
> > consumers. Only if the memory consumer continue using the buffer of
> memory
> > segment after releasing it, in which case we do want the job to fail so
> we
> > detect the memory leak early.
> >
> > For option 1.2, I don't think this is a good idea. Not only because the
> > assumption (regular GC is enough to clean direct buffers) may not always
> be
> > true, but also it makes harder for finding problems in cases of memory
> > overuse. E.g., user configured some direct memory for the user libraries.
> > If the library actually use more direct memory then configured, which
> > cannot be cleaned by GC because they are still in use, may lead to
> overuse
> > of the total container memory. In that case, if it didn't touch the JVM
> > default max direct memory limit, we cannot get a direct memory OOM and it
> > will become super hard to understand which part of the configuration need
> > to be updated.
> >
> > For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> > memory does not reach the max direct memory limit specified by the
> > dedicated parameter. I think it is slightly better than 1.2, only because
> > we can tune the parameter.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen  wrote:
> >
> > > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> > it a
> > > bit differently:
> > >
> > > We have the following two options:
> > >
> > > (1) We let MemorySegments be de-allocated by the GC. That makes it
> > segfault
> > > safe. But then we need a way to trigger GC in case de-allocation and
> > > re-allocation of a bunch of segments happens quickly, which is often
> the
> > > case during batch scheduling or task restart.
> > >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> > >   - Another way could be to have a dedicated bookkeeping in the
> > > MemoryManager (option 1.2), so that this is a number independent of the
> > > "-XX:MaxDirectMemorySize" parameter.
> > >
> > > (2) We manually allocate and de-allocate the memory for the
> > MemorySegments
> > > (option 2). That way we need not worry about triggering GC by some
> > > threshold or bookkeeping, but it is harder to prevent segfaults. We
> need
> > to
> > > be very careful about when we release the memory segments (only in the
> > > cleanup phase of the main thread).
> > >
> > > If we go with option 1.1, we probably need to set
> > > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> > and
> > > have "direct_memory" as a separate reserved memory pool. Because if 

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

2019-08-19 Thread Xintong Song
Hi everyone,

As Till suggested, the original "FLIP-53: Fine Grained Resource Management"
splits into two separate FLIPs,

   - FLIP-53: Fine Grained Operator Resource Management [1]
   - FLIP-56: Dynamic Slot Allocation [2]

We'll continue using this discussion thread for FLIP-53. For FLIP-56, I
just started a new discussion thread [3].

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation

[3]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-56-Dynamic-Slot-Allocation-td31960.html

On Mon, Aug 19, 2019 at 2:55 PM Xintong Song  wrote:

> Thinks for the comments, Yang.
>
> Regarding your questions:
>
>1. How to calculate the resource specification of TaskManagers? Do they
>>have them same resource spec calculated based on the configuration? I
>> think
>>we still have wasted resources in this situation. Or we could start
>>TaskManagers with different spec.
>>
> I agree with you that we can further improve the resource utility by
> customizing task executors with different resource specifications. However,
> I'm in favor of limiting the scope of this FLIP and leave it as a future
> optimization. The plan for that part is to move the logic of deciding task
> executor specifications into the slot manager and make slot manager
> pluggable, so inside the slot manager plugin we can have different logics
> for deciding the task executor specifications.
>
>
>>2. If a slot is released and returned to SlotPool, does it could be
>>reused by other SlotRequest that the request resource is smaller than
>> it?
>>
> No, I think slot pool should always return slots if they do not exactly
> match the pending requests, so that resource manager can deal with the
> extra resources.
>
>>   - If it is yes, what happens to the available resource in the
>
>   TaskManager.
>>   - What is the SlotStatus of the cached slot in SlotPool? The
>>   AllocationId is null?
>>
> The allocation id does not change as long as the slot is not returned from
> the job master, no matter its occupied or available in the slot pool. I
> think we have the same behavior currently. No matter how many tasks the job
> master deploy into the slot, concurrently or sequentially, it is one
> allocation from the cluster to the job until the slot is freed from the job
> master.
>
>>3. In a session cluster, some jobs are configured with operator
>>resources, meanwhile other jobs are using UNKNOWN. How to deal with
>> this
>>situation?
>
> As long as we do not mix unknown / specified resource profiles within the
> same job / slot, there shouldn't be a problem. Resource manager converts
> unknown resource profiles in slot requests to specified default resource
> profiles, so they can be dynamically allocated from task executors'
> available resources just as other slot requests with specified resource
> profiles.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 11:39 AM Yang Wang  wrote:
>
>> Hi Xintong,
>>
>>
>> Thanks for your detailed proposal. I think many users are suffering from
>> waste of resources. The resource spec of all task managers are same and we
>> have to increase all task managers to make the heavy one more stable. So
>> we
>> will benefit from the fine grained resource management a lot. We could get
>> better resource utilization and stability.
>>
>>
>> Just to share some thoughts.
>>
>>
>>
>>1. How to calculate the resource specification of TaskManagers? Do they
>>have them same resource spec calculated based on the configuration? I
>> think
>>we still have wasted resources in this situation. Or we could start
>>TaskManagers with different spec.
>>2. If a slot is released and returned to SlotPool, does it could be
>>reused by other SlotRequest that the request resource is smaller than
>> it?
>>   - If it is yes, what happens to the available resource in the
>>   TaskManager.
>>   - What is the SlotStatus of the cached slot in SlotPool? The
>>   AllocationId is null?
>>3. In a session cluster, some jobs are configured with operator
>>resources, meanwhile other jobs are using UNKNOWN. How to deal with
>> this
>>situation?
>>
>>
>>
>> Best,
>> Yang
>>
>> Xintong Song  于2019年8月16日周五 下午8:57写道:
>>
>> > Thanks for the feedbacks, Yangze and Till.
>> >
>> > Yangze,
>> >
>> > I agree with you that we should make scheduling strategy pluggable and
>> > optimize the strategy to reduce the memory fragmentation problem, and
>> > thanks for the inputs on the potential algorithmic solutions. However,
>> I'm
>> > in favor of keep this FLIP focusing on the overall mechanism design
>> rather
>> > than strategies. Solving the fragmentation issue should be considered
>> as an
>> > optimization, and I agree with Till that we probably should tackle this
>> > 

[DISCUSS] FLIP-56: Dynamic Slot Allocation

2019-08-19 Thread Xintong Song
Hi everyone,

We would like to start a discussion thread on "FLIP-56: Dynamic Slot
Allocation" [1]. This is originally part of the discussion thread for
"FLIP-53: Fine Grained Resource Management" [2]. As Till suggested, we
would like split the original discussion into two topics, and start a
separate new discussion thread as well as FLIP process for this one.

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-56%3A+Dynamic+Slot+Allocation

[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-53-Fine-Grained-Resource-Management-td31831.html


Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-19 Thread JingsongLee
Hi stephan:

About option 2:

if additional threads not cleanly shut down before we can exit the task:
In the current case of memory reuse, it has freed up the memory it
 uses. If this memory is used by other tasks and asynchronous threads
 of exited task may still be writing, there will be concurrent security
 problems, and even lead to errors in user computing results.

So I think this is a serious and intolerable bug, No matter what the
 option is, it should be avoided.

About direct memory cleaned by GC:
I don't think it is a good idea, I've encountered so many situations
 that it's too late for GC to cause DirectMemory OOM. Release and
 allocate DirectMemory depend on the type of user job, which is
 often beyond our control.

Best,
Jingsong Lee


--
From:Stephan Ewen 
Send Time:2019年8月19日(星期一) 15:56
To:dev 
Subject:Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

My main concern with option 2 (manually release memory) is that segfaults
in the JVM send off all sorts of alarms on user ends. So we need to
guarantee that this never happens.

The trickyness is in tasks that uses data structures / algorithms with
additional threads, like hash table spill/read and sorting threads. We need
to ensure that these cleanly shut down before we can exit the task.
I am not sure that we have that guaranteed already, that's why option 1.1
seemed simpler to me.

On Mon, Aug 19, 2019 at 3:42 PM Xintong Song  wrote:

> Thanks for the comments, Stephan. Summarized in this way really makes
> things easier to understand.
>
> I'm in favor of option 2, at least for the moment. I think it is not that
> difficult to keep it segfault safe for memory manager, as long as we always
> de-allocate the memory segment when it is released from the memory
> consumers. Only if the memory consumer continue using the buffer of memory
> segment after releasing it, in which case we do want the job to fail so we
> detect the memory leak early.
>
> For option 1.2, I don't think this is a good idea. Not only because the
> assumption (regular GC is enough to clean direct buffers) may not always be
> true, but also it makes harder for finding problems in cases of memory
> overuse. E.g., user configured some direct memory for the user libraries.
> If the library actually use more direct memory then configured, which
> cannot be cleaned by GC because they are still in use, may lead to overuse
> of the total container memory. In that case, if it didn't touch the JVM
> default max direct memory limit, we cannot get a direct memory OOM and it
> will become super hard to understand which part of the configuration need
> to be updated.
>
> For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> memory does not reach the max direct memory limit specified by the
> dedicated parameter. I think it is slightly better than 1.2, only because
> we can tune the parameter.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen  wrote:
>
> > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> it a
> > bit differently:
> >
> > We have the following two options:
> >
> > (1) We let MemorySegments be de-allocated by the GC. That makes it
> segfault
> > safe. But then we need a way to trigger GC in case de-allocation and
> > re-allocation of a bunch of segments happens quickly, which is often the
> > case during batch scheduling or task restart.
> >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> >   - Another way could be to have a dedicated bookkeeping in the
> > MemoryManager (option 1.2), so that this is a number independent of the
> > "-XX:MaxDirectMemorySize" parameter.
> >
> > (2) We manually allocate and de-allocate the memory for the
> MemorySegments
> > (option 2). That way we need not worry about triggering GC by some
> > threshold or bookkeeping, but it is harder to prevent segfaults. We need
> to
> > be very careful about when we release the memory segments (only in the
> > cleanup phase of the main thread).
> >
> > If we go with option 1.1, we probably need to set
> > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> and
> > have "direct_memory" as a separate reserved memory pool. Because if we
> just
> > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> jvm_overhead",
> > then there will be times when that entire memory is allocated by direct
> > buffers and we have nothing left for the JVM overhead. So we either need
> a
> > way to compensate for that (again some safety margin cutoff value) or we
> > will exceed container memory.
> >
> > If we go with option 1.2, we need to be aware that it takes elaborate
> logic
> > to push recycling of direct buffers without always triggering a full GC.
> >
> >
> > My first guess is that the options will be easiest to do in the following
> > order:
> >
> >   - Option 1.1 with a dedicated direct_memory 

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-19 Thread Stephan Ewen
My main concern with option 2 (manually release memory) is that segfaults
in the JVM send off all sorts of alarms on user ends. So we need to
guarantee that this never happens.

The trickyness is in tasks that uses data structures / algorithms with
additional threads, like hash table spill/read and sorting threads. We need
to ensure that these cleanly shut down before we can exit the task.
I am not sure that we have that guaranteed already, that's why option 1.1
seemed simpler to me.

On Mon, Aug 19, 2019 at 3:42 PM Xintong Song  wrote:

> Thanks for the comments, Stephan. Summarized in this way really makes
> things easier to understand.
>
> I'm in favor of option 2, at least for the moment. I think it is not that
> difficult to keep it segfault safe for memory manager, as long as we always
> de-allocate the memory segment when it is released from the memory
> consumers. Only if the memory consumer continue using the buffer of memory
> segment after releasing it, in which case we do want the job to fail so we
> detect the memory leak early.
>
> For option 1.2, I don't think this is a good idea. Not only because the
> assumption (regular GC is enough to clean direct buffers) may not always be
> true, but also it makes harder for finding problems in cases of memory
> overuse. E.g., user configured some direct memory for the user libraries.
> If the library actually use more direct memory then configured, which
> cannot be cleaned by GC because they are still in use, may lead to overuse
> of the total container memory. In that case, if it didn't touch the JVM
> default max direct memory limit, we cannot get a direct memory OOM and it
> will become super hard to understand which part of the configuration need
> to be updated.
>
> For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> memory does not reach the max direct memory limit specified by the
> dedicated parameter. I think it is slightly better than 1.2, only because
> we can tune the parameter.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen  wrote:
>
> > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> it a
> > bit differently:
> >
> > We have the following two options:
> >
> > (1) We let MemorySegments be de-allocated by the GC. That makes it
> segfault
> > safe. But then we need a way to trigger GC in case de-allocation and
> > re-allocation of a bunch of segments happens quickly, which is often the
> > case during batch scheduling or task restart.
> >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> >   - Another way could be to have a dedicated bookkeeping in the
> > MemoryManager (option 1.2), so that this is a number independent of the
> > "-XX:MaxDirectMemorySize" parameter.
> >
> > (2) We manually allocate and de-allocate the memory for the
> MemorySegments
> > (option 2). That way we need not worry about triggering GC by some
> > threshold or bookkeeping, but it is harder to prevent segfaults. We need
> to
> > be very careful about when we release the memory segments (only in the
> > cleanup phase of the main thread).
> >
> > If we go with option 1.1, we probably need to set
> > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> and
> > have "direct_memory" as a separate reserved memory pool. Because if we
> just
> > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> jvm_overhead",
> > then there will be times when that entire memory is allocated by direct
> > buffers and we have nothing left for the JVM overhead. So we either need
> a
> > way to compensate for that (again some safety margin cutoff value) or we
> > will exceed container memory.
> >
> > If we go with option 1.2, we need to be aware that it takes elaborate
> logic
> > to push recycling of direct buffers without always triggering a full GC.
> >
> >
> > My first guess is that the options will be easiest to do in the following
> > order:
> >
> >   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> > above. We would need to find a way to set the direct_memory parameter by
> > default. We could start with 64 MB and see how it goes in practice. One
> > danger I see is that setting this loo low can cause a bunch of additional
> > GCs compared to before (we need to watch this carefully).
> >
> >   - Option 2. It is actually quite simple to implement, we could try how
> > segfault safe we are at the moment.
> >
> >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> parameter
> > at all and assume that all the direct memory allocations that the JVM and
> > Netty do are infrequent enough to be cleaned up fast enough through
> regular
> > GC. I am not sure if that is a valid assumption, though.
> >
> > Best,
> > Stephan
> >
> >
> >
> > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song 
> > wrote:
> >
> > > Thanks for sharing your opinion Till.
> > >
> > > I'm also in favor of alternative 2. I was wondering whether we can
> 

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-19 Thread Xintong Song
Thanks for the comments, Stephan. Summarized in this way really makes
things easier to understand.

I'm in favor of option 2, at least for the moment. I think it is not that
difficult to keep it segfault safe for memory manager, as long as we always
de-allocate the memory segment when it is released from the memory
consumers. Only if the memory consumer continue using the buffer of memory
segment after releasing it, in which case we do want the job to fail so we
detect the memory leak early.

For option 1.2, I don't think this is a good idea. Not only because the
assumption (regular GC is enough to clean direct buffers) may not always be
true, but also it makes harder for finding problems in cases of memory
overuse. E.g., user configured some direct memory for the user libraries.
If the library actually use more direct memory then configured, which
cannot be cleaned by GC because they are still in use, may lead to overuse
of the total container memory. In that case, if it didn't touch the JVM
default max direct memory limit, we cannot get a direct memory OOM and it
will become super hard to understand which part of the configuration need
to be updated.

For option 1.1, it has the similar problem as 1.2, if the exceeded direct
memory does not reach the max direct memory limit specified by the
dedicated parameter. I think it is slightly better than 1.2, only because
we can tune the parameter.

Thank you~

Xintong Song



On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen  wrote:

> About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize it a
> bit differently:
>
> We have the following two options:
>
> (1) We let MemorySegments be de-allocated by the GC. That makes it segfault
> safe. But then we need a way to trigger GC in case de-allocation and
> re-allocation of a bunch of segments happens quickly, which is often the
> case during batch scheduling or task restart.
>   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
>   - Another way could be to have a dedicated bookkeeping in the
> MemoryManager (option 1.2), so that this is a number independent of the
> "-XX:MaxDirectMemorySize" parameter.
>
> (2) We manually allocate and de-allocate the memory for the MemorySegments
> (option 2). That way we need not worry about triggering GC by some
> threshold or bookkeeping, but it is harder to prevent segfaults. We need to
> be very careful about when we release the memory segments (only in the
> cleanup phase of the main thread).
>
> If we go with option 1.1, we probably need to set
> "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory" and
> have "direct_memory" as a separate reserved memory pool. Because if we just
> set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + jvm_overhead",
> then there will be times when that entire memory is allocated by direct
> buffers and we have nothing left for the JVM overhead. So we either need a
> way to compensate for that (again some safety margin cutoff value) or we
> will exceed container memory.
>
> If we go with option 1.2, we need to be aware that it takes elaborate logic
> to push recycling of direct buffers without always triggering a full GC.
>
>
> My first guess is that the options will be easiest to do in the following
> order:
>
>   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> above. We would need to find a way to set the direct_memory parameter by
> default. We could start with 64 MB and see how it goes in practice. One
> danger I see is that setting this loo low can cause a bunch of additional
> GCs compared to before (we need to watch this carefully).
>
>   - Option 2. It is actually quite simple to implement, we could try how
> segfault safe we are at the moment.
>
>   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize" parameter
> at all and assume that all the direct memory allocations that the JVM and
> Netty do are infrequent enough to be cleaned up fast enough through regular
> GC. I am not sure if that is a valid assumption, though.
>
> Best,
> Stephan
>
>
>
> On Fri, Aug 16, 2019 at 2:16 PM Xintong Song 
> wrote:
>
> > Thanks for sharing your opinion Till.
> >
> > I'm also in favor of alternative 2. I was wondering whether we can avoid
> > using Unsafe.allocate() for off-heap managed memory and network memory
> with
> > alternative 3. But after giving it a second thought, I think even for
> > alternative 3 using direct memory for off-heap managed memory could cause
> > problems.
> >
> > Hi Yang,
> >
> > Regarding your concern, I think what proposed in this FLIP it to have
> both
> > off-heap managed memory and network memory allocated through
> > Unsafe.allocate(), which means they are practically native memory and not
> > limited by JVM max direct memory. The only parts of memory limited by JVM
> > max direct memory are task off-heap memory and JVM overhead, which are
> > exactly alternative 2 suggests to set the JVM max direct memory to.
> >
> > Thank you~
> >
> 

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Stephan Ewen
Looking at FLINK-13699, it seems to be very local to Table API and HBase
connector.
We can cherry-pick that without re-running distributed tests.


On Mon, Aug 19, 2019 at 1:46 PM Till Rohrmann  wrote:

> I've merged the fix for FLINK-13752. Hence we are good to go to create the
> new RC.
>
> Cheers,
> Till
>
> On Mon, Aug 19, 2019 at 1:30 PM Timo Walther  wrote:
>
> > I support Jark's fix for FLINK-13699 because it would be disappointing
> > if both DDL and connectors are ready to handle DATE/TIME/TIMESTAMP but a
> > little component in the middle of the stack is preventing an otherwise
> > usable feature. The changes are minor.
> >
> > Thanks,
> > Timo
> >
> >
> > Am 19.08.19 um 13:24 schrieb Jark Wu:
> > > Hi Gordon,
> > >
> > > I agree that we should pick the minimal set of changes to shorten the
> > > release testing time.
> > > However, I would like to include FLINK-13699 into RC3. FLINK-13699 is a
> > > critical DDL issue, and is a small change to flink table (won't affect
> > the
> > > runtime feature and stability).
> > > I will do some tests around sql and blink planner if the RC3 include
> this
> > > fix.
> > >
> > > But if the community against to include it, I'm also fine with having
> it
> > in
> > > the next minor release.
> > >
> > > Thanks,
> > > Jark
> > >
> > > On Mon, 19 Aug 2019 at 16:16, Stephan Ewen  wrote:
> > >
> > >> +1 for Gordon's approach.
> > >>
> > >> If we do that, we can probably skip re-testing everything and mainly
> > need
> > >> to verify the release artifacts (signatures, build from source, etc.).
> > >>
> > >> If we open the RC up for changes, I fear a lot of small issues will
> > rush in
> > >> and destabilize the candidate again, meaning we have to do another
> > larger
> > >> testing effort.
> > >>
> > >>
> > >>
> > >> On Mon, Aug 19, 2019 at 9:48 AM Becket Qin 
> > wrote:
> > >>
> > >>> Hi Gordon,
> > >>>
> > >>> I remember we mentioned earlier that if there is an additional RC, we
> > can
> > >>> piggyback the GCP PubSub API change (
> > >>> https://issues.apache.org/jira/browse/FLINK-13231). It is a small
> > patch
> > >> to
> > >>> avoid future API change. So should be able to merge it very shortly.
> > >> Would
> > >>> it be possible to include that into RC3 as well?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Jiangjie (Becket) Qin
> > >>>
> > >>> On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai <
> > tzuli...@apache.org
> > >>>
> > >>> wrote:
> > >>>
> >  Hi,
> > 
> >  https://issues.apache.org/jira/browse/FLINK-13752 turns out to be
> an
> >  actual
> >  blocker, so we would have to close this RC now in favor of a new
> one.
> > 
> >  Since we are already quite past the planned release time for 1.9.0,
> I
> > >>> would
> >  like to limit the new changes included in RC3 to only the following:
> >  - https://issues.apache.org/jira/browse/FLINK-13752
> >  - Fix license and notice file issues that Kurt had found with
> >  flink-runtime-web and flink-state-processing-api
> > 
> >  This means that I will not be creating RC3 with the release-1.9
> branch
> > >> as
> >  is, but essentially only cherry-picking the above mentioned changes
> on
> > >>> top
> >  of RC2.
> >  The minimal set of changes on top of RC2 should allow us to carry
> most
> > >> if
> >  not all of the already existing votes without another round of
> > >> extensive
> >  testing, and allow us to have a shortened voting time.
> > 
> >  I understand that there are other issues mentioned in this thread
> that
> > >>> are
> >  already spotted and merged to release-1.9, especially for the Blink
> > >>> planner
> >  and DDL, but I suggest not to include them in RC3.
> >  I think it would be better to collect all the remaining issues for
> > >> those
> >  over a period of time, and include them as 1.9.1 which can ideally
> > also
> >  happen a few weeks soon after 1.9.0.
> > 
> >  What do you think? If there are not objections, I would proceed with
> > >> this
> >  plan and push out a new RC by the end of today (Aug. 19th CET).
> > 
> >  Regards,
> >  Gordon
> > 
> >  On Mon, Aug 19, 2019 at 4:09 AM Zili Chen 
> > >> wrote:
> > > We should investigate the performance regression but regardless the
> > > regression I vote +1
> > >
> > > Have verified following things
> > >
> > > - Jobs running on YARN x (Session & Per Job) with high-availability
> > > enabled.
> > > - Simulate JM and TM failures.
> > > - Simulate temporary network partition.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Stephan Ewen  于2019年8月18日周日 下午10:12写道:
> > >
> > >> For reference, this is the JIRA issue about the regression in
> > >>> question:
> > >> https://issues.apache.org/jira/browse/FLINK-13752
> > >>
> > >>
> > >> On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma 
> >  wrote:
> > >>> Hi, till
> > >>> I can 

Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

2019-08-19 Thread Xintong Song
Thinks for the comments, Yang.

Regarding your questions:

   1. How to calculate the resource specification of TaskManagers? Do they
>have them same resource spec calculated based on the configuration? I
> think
>we still have wasted resources in this situation. Or we could start
>TaskManagers with different spec.
>
I agree with you that we can further improve the resource utility by
customizing task executors with different resource specifications. However,
I'm in favor of limiting the scope of this FLIP and leave it as a future
optimization. The plan for that part is to move the logic of deciding task
executor specifications into the slot manager and make slot manager
pluggable, so inside the slot manager plugin we can have different logics
for deciding the task executor specifications.


>2. If a slot is released and returned to SlotPool, does it could be
>reused by other SlotRequest that the request resource is smaller than
> it?
>
No, I think slot pool should always return slots if they do not exactly
match the pending requests, so that resource manager can deal with the
extra resources.

>   - If it is yes, what happens to the available resource in the

  TaskManager.
>   - What is the SlotStatus of the cached slot in SlotPool? The
>   AllocationId is null?
>
The allocation id does not change as long as the slot is not returned from
the job master, no matter its occupied or available in the slot pool. I
think we have the same behavior currently. No matter how many tasks the job
master deploy into the slot, concurrently or sequentially, it is one
allocation from the cluster to the job until the slot is freed from the job
master.

>3. In a session cluster, some jobs are configured with operator
>resources, meanwhile other jobs are using UNKNOWN. How to deal with this
>situation?

As long as we do not mix unknown / specified resource profiles within the
same job / slot, there shouldn't be a problem. Resource manager converts
unknown resource profiles in slot requests to specified default resource
profiles, so they can be dynamically allocated from task executors'
available resources just as other slot requests with specified resource
profiles.

Thank you~

Xintong Song



On Mon, Aug 19, 2019 at 11:39 AM Yang Wang  wrote:

> Hi Xintong,
>
>
> Thanks for your detailed proposal. I think many users are suffering from
> waste of resources. The resource spec of all task managers are same and we
> have to increase all task managers to make the heavy one more stable. So we
> will benefit from the fine grained resource management a lot. We could get
> better resource utilization and stability.
>
>
> Just to share some thoughts.
>
>
>
>1. How to calculate the resource specification of TaskManagers? Do they
>have them same resource spec calculated based on the configuration? I
> think
>we still have wasted resources in this situation. Or we could start
>TaskManagers with different spec.
>2. If a slot is released and returned to SlotPool, does it could be
>reused by other SlotRequest that the request resource is smaller than
> it?
>   - If it is yes, what happens to the available resource in the
>   TaskManager.
>   - What is the SlotStatus of the cached slot in SlotPool? The
>   AllocationId is null?
>3. In a session cluster, some jobs are configured with operator
>resources, meanwhile other jobs are using UNKNOWN. How to deal with this
>situation?
>
>
>
> Best,
> Yang
>
> Xintong Song  于2019年8月16日周五 下午8:57写道:
>
> > Thanks for the feedbacks, Yangze and Till.
> >
> > Yangze,
> >
> > I agree with you that we should make scheduling strategy pluggable and
> > optimize the strategy to reduce the memory fragmentation problem, and
> > thanks for the inputs on the potential algorithmic solutions. However,
> I'm
> > in favor of keep this FLIP focusing on the overall mechanism design
> rather
> > than strategies. Solving the fragmentation issue should be considered as
> an
> > optimization, and I agree with Till that we probably should tackle this
> > afterwards.
> >
> > Till,
> >
> > - Regarding splitting the FLIP, I think it makes sense. The operator
> > resource management and dynamic slot allocation do not have much
> dependency
> > on each other.
> >
> > - Regarding the default slot size, I think this is similar to FLIP-49 [1]
> > where we want all the deriving happens at one place. I think it would be
> > nice to pass the default slot size into the task executor in the same way
> > that we pass in the memory pool sizes in FLIP-49 [1].
> >
> > - Regarding the return value of TaskExecutorGateway#requestResource, I
> > think you're right. We should avoid using null as the return value. I
> think
> > we probably should thrown an exception here.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >
> 

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

2019-08-19 Thread Stephan Ewen
About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize it a
bit differently:

We have the following two options:

(1) We let MemorySegments be de-allocated by the GC. That makes it segfault
safe. But then we need a way to trigger GC in case de-allocation and
re-allocation of a bunch of segments happens quickly, which is often the
case during batch scheduling or task restart.
  - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
  - Another way could be to have a dedicated bookkeeping in the
MemoryManager (option 1.2), so that this is a number independent of the
"-XX:MaxDirectMemorySize" parameter.

(2) We manually allocate and de-allocate the memory for the MemorySegments
(option 2). That way we need not worry about triggering GC by some
threshold or bookkeeping, but it is harder to prevent segfaults. We need to
be very careful about when we release the memory segments (only in the
cleanup phase of the main thread).

If we go with option 1.1, we probably need to set
"-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory" and
have "direct_memory" as a separate reserved memory pool. Because if we just
set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + jvm_overhead",
then there will be times when that entire memory is allocated by direct
buffers and we have nothing left for the JVM overhead. So we either need a
way to compensate for that (again some safety margin cutoff value) or we
will exceed container memory.

If we go with option 1.2, we need to be aware that it takes elaborate logic
to push recycling of direct buffers without always triggering a full GC.


My first guess is that the options will be easiest to do in the following
order:

  - Option 1.1 with a dedicated direct_memory parameter, as discussed
above. We would need to find a way to set the direct_memory parameter by
default. We could start with 64 MB and see how it goes in practice. One
danger I see is that setting this loo low can cause a bunch of additional
GCs compared to before (we need to watch this carefully).

  - Option 2. It is actually quite simple to implement, we could try how
segfault safe we are at the moment.

  - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize" parameter
at all and assume that all the direct memory allocations that the JVM and
Netty do are infrequent enough to be cleaned up fast enough through regular
GC. I am not sure if that is a valid assumption, though.

Best,
Stephan



On Fri, Aug 16, 2019 at 2:16 PM Xintong Song  wrote:

> Thanks for sharing your opinion Till.
>
> I'm also in favor of alternative 2. I was wondering whether we can avoid
> using Unsafe.allocate() for off-heap managed memory and network memory with
> alternative 3. But after giving it a second thought, I think even for
> alternative 3 using direct memory for off-heap managed memory could cause
> problems.
>
> Hi Yang,
>
> Regarding your concern, I think what proposed in this FLIP it to have both
> off-heap managed memory and network memory allocated through
> Unsafe.allocate(), which means they are practically native memory and not
> limited by JVM max direct memory. The only parts of memory limited by JVM
> max direct memory are task off-heap memory and JVM overhead, which are
> exactly alternative 2 suggests to set the JVM max direct memory to.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann 
> wrote:
>
> > Thanks for the clarification Xintong. I understand the two alternatives
> > now.
> >
> > I would be in favour of option 2 because it makes things explicit. If we
> > don't limit the direct memory, I fear that we might end up in a similar
> > situation as we are currently in: The user might see that her process
> gets
> > killed by the OS and does not know why this is the case. Consequently,
> she
> > tries to decrease the process memory size (similar to increasing the
> cutoff
> > ratio) in order to accommodate for the extra direct memory. Even worse,
> she
> > tries to decrease memory budgets which are not fully used and hence won't
> > change the overall memory consumption.
> >
> > Cheers,
> > Till
> >
> > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song 
> > wrote:
> >
> > > Let me explain this with a concrete example Till.
> > >
> > > Let's say we have the following scenario.
> > >
> > > Total Process Memory: 1GB
> > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory
> and
> > > Network Memory): 800MB
> > >
> > >
> > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> value,
> > > let's say 1TB.
> > >
> > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > Overhead
> > > do not exceed 200MB, then alternative 2 and alternative 3 should have
> the
> > > same utility. Setting larger -XX:MaxDirectMemorySize will not reduce
> the
> > > sizes of 

Re: [VOTE] FLIP-50: Spill-able Heap State Backend

2019-08-19 Thread Tzu-Li (Gordon) Tai
+1

On Sun, Aug 18, 2019 at 4:30 PM Stephan Ewen  wrote:

> +1
>
> On Sun, Aug 18, 2019 at 3:31 PM Till Rohrmann 
> wrote:
>
> > +1
> >
> > On Fri, Aug 16, 2019 at 4:54 PM Yu Li  wrote:
> >
> > > Hi All,
> > >
> > > Since we have reached a consensus in the discussion thread [1], I'd
> like
> > to
> > > start the voting for FLIP-50 [2].
> > >
> > > This vote will be open for at least 72 hours. Unless objection I will
> try
> > > to close it by end of Tuesday August 20, 2019 if we have sufficient
> > votes.
> > > Thanks.
> > >
> > > [1] https://s.apache.org/cq358
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend
> > >
> > > Best Regards,
> > > Yu
> > >
> >
>


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Till Rohrmann
I've merged the fix for FLINK-13752. Hence we are good to go to create the
new RC.

Cheers,
Till

On Mon, Aug 19, 2019 at 1:30 PM Timo Walther  wrote:

> I support Jark's fix for FLINK-13699 because it would be disappointing
> if both DDL and connectors are ready to handle DATE/TIME/TIMESTAMP but a
> little component in the middle of the stack is preventing an otherwise
> usable feature. The changes are minor.
>
> Thanks,
> Timo
>
>
> Am 19.08.19 um 13:24 schrieb Jark Wu:
> > Hi Gordon,
> >
> > I agree that we should pick the minimal set of changes to shorten the
> > release testing time.
> > However, I would like to include FLINK-13699 into RC3. FLINK-13699 is a
> > critical DDL issue, and is a small change to flink table (won't affect
> the
> > runtime feature and stability).
> > I will do some tests around sql and blink planner if the RC3 include this
> > fix.
> >
> > But if the community against to include it, I'm also fine with having it
> in
> > the next minor release.
> >
> > Thanks,
> > Jark
> >
> > On Mon, 19 Aug 2019 at 16:16, Stephan Ewen  wrote:
> >
> >> +1 for Gordon's approach.
> >>
> >> If we do that, we can probably skip re-testing everything and mainly
> need
> >> to verify the release artifacts (signatures, build from source, etc.).
> >>
> >> If we open the RC up for changes, I fear a lot of small issues will
> rush in
> >> and destabilize the candidate again, meaning we have to do another
> larger
> >> testing effort.
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 9:48 AM Becket Qin 
> wrote:
> >>
> >>> Hi Gordon,
> >>>
> >>> I remember we mentioned earlier that if there is an additional RC, we
> can
> >>> piggyback the GCP PubSub API change (
> >>> https://issues.apache.org/jira/browse/FLINK-13231). It is a small
> patch
> >> to
> >>> avoid future API change. So should be able to merge it very shortly.
> >> Would
> >>> it be possible to include that into RC3 as well?
> >>>
> >>> Thanks,
> >>>
> >>> Jiangjie (Becket) Qin
> >>>
> >>> On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai <
> tzuli...@apache.org
> >>>
> >>> wrote:
> >>>
>  Hi,
> 
>  https://issues.apache.org/jira/browse/FLINK-13752 turns out to be an
>  actual
>  blocker, so we would have to close this RC now in favor of a new one.
> 
>  Since we are already quite past the planned release time for 1.9.0, I
> >>> would
>  like to limit the new changes included in RC3 to only the following:
>  - https://issues.apache.org/jira/browse/FLINK-13752
>  - Fix license and notice file issues that Kurt had found with
>  flink-runtime-web and flink-state-processing-api
> 
>  This means that I will not be creating RC3 with the release-1.9 branch
> >> as
>  is, but essentially only cherry-picking the above mentioned changes on
> >>> top
>  of RC2.
>  The minimal set of changes on top of RC2 should allow us to carry most
> >> if
>  not all of the already existing votes without another round of
> >> extensive
>  testing, and allow us to have a shortened voting time.
> 
>  I understand that there are other issues mentioned in this thread that
> >>> are
>  already spotted and merged to release-1.9, especially for the Blink
> >>> planner
>  and DDL, but I suggest not to include them in RC3.
>  I think it would be better to collect all the remaining issues for
> >> those
>  over a period of time, and include them as 1.9.1 which can ideally
> also
>  happen a few weeks soon after 1.9.0.
> 
>  What do you think? If there are not objections, I would proceed with
> >> this
>  plan and push out a new RC by the end of today (Aug. 19th CET).
> 
>  Regards,
>  Gordon
> 
>  On Mon, Aug 19, 2019 at 4:09 AM Zili Chen 
> >> wrote:
> > We should investigate the performance regression but regardless the
> > regression I vote +1
> >
> > Have verified following things
> >
> > - Jobs running on YARN x (Session & Per Job) with high-availability
> > enabled.
> > - Simulate JM and TM failures.
> > - Simulate temporary network partition.
> >
> > Best,
> > tison.
> >
> >
> > Stephan Ewen  于2019年8月18日周日 下午10:12写道:
> >
> >> For reference, this is the JIRA issue about the regression in
> >>> question:
> >> https://issues.apache.org/jira/browse/FLINK-13752
> >>
> >>
> >> On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma 
>  wrote:
> >>> Hi, till
> >>> I can send the job to you offline.
> >>> It is just a datastream job and does not use
> >> TwoInputSelectableStreamTask.
> >>> A->B
> >>>   \
> >>> C
> >>>   /
> >>> D->E
> >>> Best,
> >>> Guowei
> >>>
> >>>
> >>> Till Rohrmann  于2019年8月16日周五 下午4:34写道:
> >>>
>  Thanks for reporting this issue Guowei. Could you share a bit
> >>> more
> >>> details
>  what the job exactly does and which operators it uses? Does 

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Timo Walther
I support Jark's fix for FLINK-13699 because it would be disappointing 
if both DDL and connectors are ready to handle DATE/TIME/TIMESTAMP but a 
little component in the middle of the stack is preventing an otherwise 
usable feature. The changes are minor.


Thanks,
Timo


Am 19.08.19 um 13:24 schrieb Jark Wu:

Hi Gordon,

I agree that we should pick the minimal set of changes to shorten the
release testing time.
However, I would like to include FLINK-13699 into RC3. FLINK-13699 is a
critical DDL issue, and is a small change to flink table (won't affect the
runtime feature and stability).
I will do some tests around sql and blink planner if the RC3 include this
fix.

But if the community against to include it, I'm also fine with having it in
the next minor release.

Thanks,
Jark

On Mon, 19 Aug 2019 at 16:16, Stephan Ewen  wrote:


+1 for Gordon's approach.

If we do that, we can probably skip re-testing everything and mainly need
to verify the release artifacts (signatures, build from source, etc.).

If we open the RC up for changes, I fear a lot of small issues will rush in
and destabilize the candidate again, meaning we have to do another larger
testing effort.



On Mon, Aug 19, 2019 at 9:48 AM Becket Qin  wrote:


Hi Gordon,

I remember we mentioned earlier that if there is an additional RC, we can
piggyback the GCP PubSub API change (
https://issues.apache.org/jira/browse/FLINK-13231). It is a small patch

to

avoid future API change. So should be able to merge it very shortly.

Would

it be possible to include that into RC3 as well?

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai 
Hi,

https://issues.apache.org/jira/browse/FLINK-13752 turns out to be an
actual
blocker, so we would have to close this RC now in favor of a new one.

Since we are already quite past the planned release time for 1.9.0, I

would

like to limit the new changes included in RC3 to only the following:
- https://issues.apache.org/jira/browse/FLINK-13752
- Fix license and notice file issues that Kurt had found with
flink-runtime-web and flink-state-processing-api

This means that I will not be creating RC3 with the release-1.9 branch

as

is, but essentially only cherry-picking the above mentioned changes on

top

of RC2.
The minimal set of changes on top of RC2 should allow us to carry most

if

not all of the already existing votes without another round of

extensive

testing, and allow us to have a shortened voting time.

I understand that there are other issues mentioned in this thread that

are

already spotted and merged to release-1.9, especially for the Blink

planner

and DDL, but I suggest not to include them in RC3.
I think it would be better to collect all the remaining issues for

those

over a period of time, and include them as 1.9.1 which can ideally also
happen a few weeks soon after 1.9.0.

What do you think? If there are not objections, I would proceed with

this

plan and push out a new RC by the end of today (Aug. 19th CET).

Regards,
Gordon

On Mon, Aug 19, 2019 at 4:09 AM Zili Chen 

wrote:

We should investigate the performance regression but regardless the
regression I vote +1

Have verified following things

- Jobs running on YARN x (Session & Per Job) with high-availability
enabled.
- Simulate JM and TM failures.
- Simulate temporary network partition.

Best,
tison.


Stephan Ewen  于2019年8月18日周日 下午10:12写道:


For reference, this is the JIRA issue about the regression in

question:

https://issues.apache.org/jira/browse/FLINK-13752


On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma 

wrote:

Hi, till
I can send the job to you offline.
It is just a datastream job and does not use

TwoInputSelectableStreamTask.

A->B
  \
C
  /
D->E
Best,
Guowei


Till Rohrmann  于2019年8月16日周五 下午4:34写道:


Thanks for reporting this issue Guowei. Could you share a bit

more

details

what the job exactly does and which operators it uses? Does the

job

uses

the new `TwoInputSelectableStreamTask` which might cause the

performance

regression?

I think it is important to understand where the problem comes

from

before

we proceed with the release.

Cheers,
Till

On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma <

guowei@gmail.com

wrote:

Hi,
-1
We have a benchmark job, which includes a two-input operator.
This job has a big performance regression using 1.9 compared

to

1.8.

It's still not very clear why this regression happens.

Best,
Guowei


Yu Li  于2019年8月16日周五 下午3:27写道:


+1 (non-binding)

- checked release notes: OK
- checked sums and signatures: OK
- source release
  - contains no binaries: OK
  - contains no 1.9-SNAPSHOT references: OK
  - build from source: OK (8u102)
  - mvn clean verify: OK (8u102)
- binary release
  - no examples appear to be missing
  - started a cluster; WebUI reachable, example ran

successfully

- repository appears to contain all expected artifacts

Best Regards,
Yu


On Fri, 16 Aug 2019 at 06:06, Bowen Li <


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Jark Wu
Hi Gordon,

I agree that we should pick the minimal set of changes to shorten the
release testing time.
However, I would like to include FLINK-13699 into RC3. FLINK-13699 is a
critical DDL issue, and is a small change to flink table (won't affect the
runtime feature and stability).
I will do some tests around sql and blink planner if the RC3 include this
fix.

But if the community against to include it, I'm also fine with having it in
the next minor release.

Thanks,
Jark

On Mon, 19 Aug 2019 at 16:16, Stephan Ewen  wrote:

> +1 for Gordon's approach.
>
> If we do that, we can probably skip re-testing everything and mainly need
> to verify the release artifacts (signatures, build from source, etc.).
>
> If we open the RC up for changes, I fear a lot of small issues will rush in
> and destabilize the candidate again, meaning we have to do another larger
> testing effort.
>
>
>
> On Mon, Aug 19, 2019 at 9:48 AM Becket Qin  wrote:
>
> > Hi Gordon,
> >
> > I remember we mentioned earlier that if there is an additional RC, we can
> > piggyback the GCP PubSub API change (
> > https://issues.apache.org/jira/browse/FLINK-13231). It is a small patch
> to
> > avoid future API change. So should be able to merge it very shortly.
> Would
> > it be possible to include that into RC3 as well?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai  >
> > wrote:
> >
> > > Hi,
> > >
> > > https://issues.apache.org/jira/browse/FLINK-13752 turns out to be an
> > > actual
> > > blocker, so we would have to close this RC now in favor of a new one.
> > >
> > > Since we are already quite past the planned release time for 1.9.0, I
> > would
> > > like to limit the new changes included in RC3 to only the following:
> > > - https://issues.apache.org/jira/browse/FLINK-13752
> > > - Fix license and notice file issues that Kurt had found with
> > > flink-runtime-web and flink-state-processing-api
> > >
> > > This means that I will not be creating RC3 with the release-1.9 branch
> as
> > > is, but essentially only cherry-picking the above mentioned changes on
> > top
> > > of RC2.
> > > The minimal set of changes on top of RC2 should allow us to carry most
> if
> > > not all of the already existing votes without another round of
> extensive
> > > testing, and allow us to have a shortened voting time.
> > >
> > > I understand that there are other issues mentioned in this thread that
> > are
> > > already spotted and merged to release-1.9, especially for the Blink
> > planner
> > > and DDL, but I suggest not to include them in RC3.
> > > I think it would be better to collect all the remaining issues for
> those
> > > over a period of time, and include them as 1.9.1 which can ideally also
> > > happen a few weeks soon after 1.9.0.
> > >
> > > What do you think? If there are not objections, I would proceed with
> this
> > > plan and push out a new RC by the end of today (Aug. 19th CET).
> > >
> > > Regards,
> > > Gordon
> > >
> > > On Mon, Aug 19, 2019 at 4:09 AM Zili Chen 
> wrote:
> > >
> > > > We should investigate the performance regression but regardless the
> > > > regression I vote +1
> > > >
> > > > Have verified following things
> > > >
> > > > - Jobs running on YARN x (Session & Per Job) with high-availability
> > > > enabled.
> > > > - Simulate JM and TM failures.
> > > > - Simulate temporary network partition.
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > Stephan Ewen  于2019年8月18日周日 下午10:12写道:
> > > >
> > > > > For reference, this is the JIRA issue about the regression in
> > question:
> > > > >
> > > > > https://issues.apache.org/jira/browse/FLINK-13752
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma 
> > > wrote:
> > > > >
> > > > > > Hi, till
> > > > > > I can send the job to you offline.
> > > > > > It is just a datastream job and does not use
> > > > > TwoInputSelectableStreamTask.
> > > > > > A->B
> > > > > >  \
> > > > > >C
> > > > > >  /
> > > > > > D->E
> > > > > > Best,
> > > > > > Guowei
> > > > > >
> > > > > >
> > > > > > Till Rohrmann  于2019年8月16日周五 下午4:34写道:
> > > > > >
> > > > > > > Thanks for reporting this issue Guowei. Could you share a bit
> > more
> > > > > > details
> > > > > > > what the job exactly does and which operators it uses? Does the
> > job
> > > > > uses
> > > > > > > the new `TwoInputSelectableStreamTask` which might cause the
> > > > > performance
> > > > > > > regression?
> > > > > > >
> > > > > > > I think it is important to understand where the problem comes
> > from
> > > > > before
> > > > > > > we proceed with the release.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma <
> guowei@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > -1
> > > > > > > > We have a benchmark job, which includes a two-input operator.
> > > > > > > > This job has a big performance 

Re: [DISCUSS] FLIP-54: Evolve ConfigOption and Configuration

2019-08-19 Thread Timo Walther

Hi Stephan,

thanks for your suggestions. Let me give you some background about the 
decisions made in this FLIP:


1. Goal: The FLIP is labelled "evolve" not "rework" because we did not 
want to change the entire configuration infrastructure. Both for 
backwards-compatibility reasons and the amount of work that would be 
required to update all options. If our goal is to rework the 
configuration option entirely, I might suggest to switch to JSON format 
with JSON schema and JSON validator. However, setting properties in a 
CLI or web interface becomes more tricky the more nested structures are 
allowed.


2. Class-based Options: The current ConfigOption class is centered 
around Java classes where T is determined by the default value. The FLIP 
just makes this more explicit by offering an explicit `intType()` method 
etc. The current design of validators centered around Java classes makes 
it possible to have typical domain validators baked by generics as you 
suggested. If we introduce types such as "quantity with measure and 
unit" we still need to get a class out of this option at the end, so why 
changing a proven concept?


3. List Options: The `isList` prevents having arbitrary nesting. As 
Dawid mentioned, we kept human readability in mind. For every atomic 
option like "key=12" can be represented by a list "keys=12;13". But we 
don't want to go further; esp. no nesting. A dedicated list option would 
start making this more complicated such as 
"ListOption(ObjectOption(ListOption(IntOption, ...), 
StringOption(...)))", do we want that?


4. Correlation: The correlation part is one of the suggestions that I 
like least in the document. We can also discuss removing it entirely, 
but I think it solves the use case of relating options with each other 
in a flexible way right next to the actual option. Instead of being 
hidden in some component initialization, we should put it close to the 
option to also perform validation eagerly instead of failing at runtime 
when the option is accessed the first time.


Regards,
Timo


Am 18.08.19 um 23:32 schrieb Stephan Ewen:

A "List Type" sounds like a good direction to me.

The comment on the type system was a bit brief, I agree. The idea is to see
if something like that can ease validation. Especially the correlation
system seems quite complex (proxies to work around order of initialization).

For example, let's assume we don't think primarily about "java types" but
would define types as one of the following (just examples, haven't thought
all the details through):

   (a) category type: implies string, and a fix set of possible values.
Those would be passes and naturally make it into the docs and validation.
Maps to a String or Enum in Java.

   (b) numeric integer type: implies long (or optionally integer, if we want
to automatically check overflow / underflow). would take typical domain
validators, like non-negative, etc.

   (c) numeric real type: same as above (double or float)

   (d) numeric interval type: either defined as an interval, or references
other parameter by key. validation by valid interval.

   (e) quantity: a measure and a unit. separately parsable. The measure's
type could be any of the numeric types above, with same validation rules.

With a system like the above, would we still correlation validators? Are
there still cases that we need to catch early (config loading) or are the
remaining cases sufficiently rare and runtime or setup specific, that it is
fine to handle them in component initialization?


On Sun, Aug 18, 2019 at 6:36 PM Dawid Wysakowicz 
wrote:


Hi Stephan,

Thank you for your opinion.

Actually list/composite types are the topics we spent the most of the
time. I understand that from a perspective of a full blown type system,
a field like isList may look weird. Please let me elaborate a bit more
on the reason behind it though. Maybe we weren't clear enough about it
in the FLIP. The key feature of all the conifg options is that they must
have a string representation as they might come from a configuration
file. Moreover it must be a human readable format, so that the values
might be manually adjusted. Having that in mind we did not want to add a
support of an arbitrary nesting and we decided to allow for lists only
(and flat objects - I think though in the current design there is a
mistake around the Configurable interface). I think though you have a
point here and it would be better to have a ListConfigOption instead of
this field. Does it make sense to you?

As for the second part of your message. I am not sure if I understood
it. The validators work with parse/deserialized values from
Configuration that means they can be bound to the generic parameter of
Configuration. You can have a RangeValidator. I don't think the type hierarchy in the ConfigOption
has anything to do with the validation logic. Could you elaborate a bit
more what did you mean?

Best,

Dawid

On 18/08/2019 16:42, Stephan Ewen wrote:

I like the idea 

Re: [DISCUSS] Update our Roadmap

2019-08-19 Thread Robert Metzger
Nice, thanks! Looking forward to the PR

On Mon, Aug 19, 2019 at 11:08 AM Marta Paes Moreira 
wrote:

> Hey, Robert.
>
> Updating the roadmap is something I can work on (and also have on my radar,
> moving forward). Already had a quick word with Stephan and he's available
> to provide support, if needed.
>
> Marta
>
> On Sun, Aug 18, 2019 at 4:46 PM Stephan Ewen  wrote:
>
> > I could help with that.
> >
> > On Fri, Aug 16, 2019 at 2:36 PM Robert Metzger 
> > wrote:
> >
> > > Flink 1.9 is feature freezed and almost released.
> > > I guess it makes sense to update the roadmap on the website again.
> > >
> > > Who feels like having a good overview of what's coming up?
> > >
> > > On Tue, May 7, 2019 at 4:33 PM Fabian Hueske 
> wrote:
> > >
> > > > Yes, that's a very good proposal Jark.
> > > > +1
> > > >
> > > > Best, Fabian
> > > >
> > > > Am Mo., 6. Mai 2019 um 16:33 Uhr schrieb Till Rohrmann <
> > > > trohrm...@apache.org
> > > > >:
> > > >
> > > > > I think this is a good idea Jark. Putting the last update date on
> the
> > > > > roadmap would also force us to regularly update it.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Mon, May 6, 2019 at 4:14 AM Jark Wu  wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > One suggestion for the roadmap:
> > > > > >
> > > > > > Shall we add a `latest-update-time` to the top of Roadmap page?
> So
> > > that
> > > > > > users can know this is a up-to-date Roadmap.
> > > > > >
> > > > > > On Thu, 2 May 2019 at 04:49, Bowen Li 
> wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > On Mon, Apr 29, 2019 at 11:41 PM jincheng sun <
> > > > > sunjincheng...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jeff,
> > > > > > > >
> > > > > > > > I have open the PR about add Python Table API section to the
> > > > > roadmap. I
> > > > > > > > appreciate if you have time to look at it. :)
> > > > > > > >
> > > > > > > > https://github.com/apache/flink-web/pull/204
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Jincheng
> > > > > > > >
> > > > > > > > jincheng sun  于2019年4月29日周一
> > 下午11:12写道:
> > > > > > > >
> > > > > > > > > Sure, I will do it!I think the python table api info should
> > in
> > > > the
> > > > > > > > >  roadmap! Thank you @Jeff @Fabian
> > > > > > > > >
> > > > > > > > > Fabian Hueske 于2019年4月29日 周一23:05写道:
> > > > > > > > >
> > > > > > > > >> Great, thanks Jeff and Timo!
> > > > > > > > >>
> > > > > > > > >> @Jincheng do you want to write a paragraph about the
> Python
> > > > effort
> > > > > > and
> > > > > > > > >> open a PR for it?
> > > > > > > > >>
> > > > > > > > >> I'll remove the issue about Hadoop convenience builds
> > > > > (FLINK-11266).
> > > > > > > > >>
> > > > > > > > >> Best, Fabian
> > > > > > > > >>
> > > > > > > > >> Am Mo., 29. Apr. 2019 um 16:37 Uhr schrieb Jeff Zhang <
> > > > > > > zjf...@gmail.com
> > > > > > > > >:
> > > > > > > > >>
> > > > > > > > >>> jincheng(cc) is driving the python effort, I think he can
> > > help
> > > > to
> > > > > > > > >>> prepare it.
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> Fabian Hueske  于2019年4月29日周一
> 下午10:15写道:
> > > > > > > > >>>
> > > > > > > >  Hi everyone,
> > > > > > > > 
> > > > > > > >  Since we had no more comments on this thread, I think we
> > > > proceed
> > > > > > to
> > > > > > > >  update the roadmap.
> > > > > > > > 
> > > > > > > >  @Jeff Zhang  I agree, we should add
> the
> > > > > Python
> > > > > > > >  efforts to the roadmap.
> > > > > > > >  Do you want to prepare a short paragraph that we can add
> > to
> > > > the
> > > > > > > >  document?
> > > > > > > > 
> > > > > > > >  Best, Fabian
> > > > > > > > 
> > > > > > > >  Am Mi., 17. Apr. 2019 um 15:04 Uhr schrieb Jeff Zhang <
> > > > > > > > zjf...@gmail.com
> > > > > > > >  >:
> > > > > > > > 
> > > > > > > > > Hi Fabian,
> > > > > > > > >
> > > > > > > > > One thing missing is python api and python udf, we
> > already
> > > > > > > discussed
> > > > > > > > > it in
> > > > > > > > > community, and it is very close to reach consensus.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Fabian Hueske  于2019年4月17日周三
> > 下午7:51写道:
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > We recently added a roadmap to our project website
> [1]
> > > and
> > > > > > > decided
> > > > > > > > to
> > > > > > > > > > update it after every release. Flink 1.8.0 was
> > released a
> > > > few
> > > > > > > days
> > > > > > > > > ago, so
> > > > > > > > > > I think it we should check and remove from the
> roadmap
> > > what
> > > > > was
> > > > > > > > > achieved so
> > > > > > > > > > far and add features / improvements that we plan for
> > the
> > > > > > future.
> > > > > > > > > >
> > > > > > > > > > I had a look at the roadmap and found that
> > > > > > > > 

Re: [DISCUSS] Release flink-shaded 8.0

2019-08-19 Thread Nico Kruber
I quickly went through all the changelogs for Netty 4.1.32 (which we
currently use) to the latest Netty 4.1.39.Final. Below, you will find a
list of bug fixes and performance improvements that may affect us. Nice
changes we could benefit from, also for the Java > 8 efforts. The most
important ones fixing leaks etc are #8921, #9167, #9274, #9394, and the
various CompositeByteBuf fixes. The rest are mostly performance
improvements.

Since we are still early in the dev cycle for Flink 1.10, it would maybe
nice to update and verify that the new version works correctly. I'll
create a ticket and PR.


FYI (1): My own patches to bring dynamically-linked openSSL to more
distributions, namely SUSE and Arch, have not made it into a release yet.

FYI (2): We are currently using the latest version of netty-tcnative,
i.e. 2.0.25.


Nico

--
Netty 4.1.33.Final
- Fix ClassCastException and native crash when using kqueue transport
(#8665)
- Provide a way to cache the internal nioBuffer of the PooledByteBuffer
to reduce GC (#8603)

Netty 4.1.34.Final
- Do not use GetPrimitiveArrayCritical(...) due multiple not-fixed bugs
related to GCLocker (#8921)
- Correctly monkey-patch id also in whe os / arch is used within library
name (#8913)
- Further reduce ensureAccessible() overhead (#8895)
- Support using an Executor to offload blocking / long-running tasks
when processing TLS / SSL via the SslHandler (#8847)
- Minimize memory footprint for AbstractChannelHandlerContext for
handlers that execute in the EventExecutor (#8786)
- Fix three bugs in CompositeByteBuf (#8773)

Netty 4.1.35.Final
- Fix possible ByteBuf leak when CompositeByteBuf is resized (#8946)
- Correctly produce ssl alert when certificate validation fails on the
client-side when using native SSL implementation (#8949)

Netty 4.1.37.Final
- Don't filter out TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (#9274)
- Try to mark child channel writable again once the parent channel
becomes writable (#9254)
- Properly debounce wakeups (#9191)
- Don't read from timerfd and eventfd on each EventLoop tick (#9192)
- Correctly detect that KeyManagerFactory is not supported when using
OpenSSL 1.1.0+ (#9170)
- Fix possible unsafe sharing of internal NIO buffer in CompositeByteBuf
(#9169)
- KQueueEventLoop won't unregister active channels reusing a file
descriptor (#9149)
- Prefer direct io buffers if direct buffers pooled (#9167)

Netty 4.1.38.Final
- Prevent ByteToMessageDecoder from overreading when !isAutoRead (#9252)
- Correctly take length of ByteBufInputStream into account for
readLine() / readByte() (#9310)
- availableSharedCapacity will be slowly exhausted (#9394)
--

On 18/08/2019 16:47, Stephan Ewen wrote:
> Are we fine with the current Netty version, or would be want to bump it?
> 
> On Fri, Aug 16, 2019 at 10:30 AM Chesnay Schepler  > wrote:
> 
> Hello,
> 
> I would like to kick off the next flink-shaded release next week. There
> are 2 ongoing efforts that are blocked on this release:
> 
>   * [FLINK-13467] Java 11 support requires a bump to ASM to correctly
>     handle Java 11 bytecode
>   * [FLINK-11767] Reworking the typeSerializerSnapshotMigrationTestBase
>     requires asm-commons to be added to flink-shaded-asm
> 
> Are there any other changes on anyone's radar that we will have to make
> for 1.10? (will bumping calcite require anything, for example)
> 
> 

-- 
Nico Kruber | Solutions Architect

Follow us @VervericaData Ververica
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen



signature.asc
Description: OpenPGP digital signature


Re: [DISCUSS][CODE STYLE] Create collections always with initial capacity

2019-08-19 Thread Andrey Zagrebin
Hi All,

It looks like this proposal has an approval and we can conclude this
discussion.
Additionally, I agree with Piotr we should really force the proven good
reasoning for setting the capacity to avoid confusion, redundancy and other
already mentioned things while reading and maintaining the code.
Ideally the need of setting the capacity should be either immediately clear
(e.g. perf etc) or explained in comments if it is non-trivial.
Although, it can easily enter a grey zone, so I would not demand strictly
performance measurement proof e.g. if the size is known and it is "per
record" code.
At the end of the day it is a decision of the code developer and reviewer.

The conclusion is then:
Set the initial capacity only if there is a good proven reason to do it.
Otherwise do not clutter the code with it.

Best,
Andrey

On Thu, Aug 1, 2019 at 5:10 PM Piotr Nowojski  wrote:

> Hi,
>
> > - a bit more code, increases maintenance burden.
>
> I think there is even more to that. It’s almost like a code duplication,
> albeit expressed in very different way, with all of the drawbacks of
> duplicated code: initial capacity can drift out of sync, causing confusion.
> Also it’s not “a bit more code”, it might be non trivial
> reasoning/calculation how to set the initial value. Whenever we change
> something/refactor the code, "maintenance burden” will mostly come from
> that.
>
> Also I think this just usually falls under a premature optimisation rule.
>
> Besides:
>
> > The conclusion is the following at the moment:
> > Only set the initial capacity if you have a good idea about the expected
> size.
>
> I would add a clause to set the initial capacity “only for good proven
> reasons”. It’s not about whether we can set it, but whether it makes sense
> to do so (to avoid the before mentioned "maintenance burden”).
>
> Piotrek
>
> > On 1 Aug 2019, at 14:41, Xintong Song  wrote:
> >
> > +1 on setting initial capacity only when have good expectation on the
> > collection size.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, Aug 1, 2019 at 2:32 PM Andrey Zagrebin 
> wrote:
> >
> >> Hi all,
> >>
> >> As you probably already noticed, Stephan has triggered a discussion
> thread
> >> about code style guide for Flink [1]. Recently we were discussing
> >> internally some smaller concerns and I would like start separate threads
> >> for them.
> >>
> >> This thread is about creating collections always with initial capacity.
> As
> >> you might have seen, some parts of our code base always initialise
> >> collections with some non-default capacity. You can even activate a
> check
> >> in IntelliJ Idea that can monitor and highlight creation of collection
> >> without initial capacity.
> >>
> >> Pros:
> >> - performance gain if there is a good reasoning about initial capacity
> >> - the capacity is always deterministic and does not depend on any
> changes
> >> of its default value in Java
> >> - easy to follow: always initialise, has IDE support for detection
> >>
> >> Cons (for initialising w/o good reasoning):
> >> - We are trying to outsmart JVM. When there is no good reasoning about
> >> initial capacity, we can rely on JVM default value.
> >> - It is even confusing e.g. for hash maps as the real size depends on
> the
> >> load factor.
> >> - It would only add minor performance gain.
> >> - a bit more code, increases maintenance burden.
> >>
> >> The conclusion is the following at the moment:
> >> Only set the initial capacity if you have a good idea about the expected
> >> size.
> >>
> >> Please, feel free to share you thoughts.
> >>
> >> Best,
> >> Andrey
> >>
> >> [1]
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3ced91df4b-7cab-4547-a430-85bc710fd...@apache.org%3E
> >>
>
>


Re: [DISCUSS] FLIP-53: Fine Grained Resource Management

2019-08-19 Thread Yang Wang
Hi Xintong,


Thanks for your detailed proposal. I think many users are suffering from
waste of resources. The resource spec of all task managers are same and we
have to increase all task managers to make the heavy one more stable. So we
will benefit from the fine grained resource management a lot. We could get
better resource utilization and stability.


Just to share some thoughts.



   1. How to calculate the resource specification of TaskManagers? Do they
   have them same resource spec calculated based on the configuration? I think
   we still have wasted resources in this situation. Or we could start
   TaskManagers with different spec.
   2. If a slot is released and returned to SlotPool, does it could be
   reused by other SlotRequest that the request resource is smaller than it?
  - If it is yes, what happens to the available resource in the
  TaskManager.
  - What is the SlotStatus of the cached slot in SlotPool? The
  AllocationId is null?
   3. In a session cluster, some jobs are configured with operator
   resources, meanwhile other jobs are using UNKNOWN. How to deal with this
   situation?



Best,
Yang

Xintong Song  于2019年8月16日周五 下午8:57写道:

> Thanks for the feedbacks, Yangze and Till.
>
> Yangze,
>
> I agree with you that we should make scheduling strategy pluggable and
> optimize the strategy to reduce the memory fragmentation problem, and
> thanks for the inputs on the potential algorithmic solutions. However, I'm
> in favor of keep this FLIP focusing on the overall mechanism design rather
> than strategies. Solving the fragmentation issue should be considered as an
> optimization, and I agree with Till that we probably should tackle this
> afterwards.
>
> Till,
>
> - Regarding splitting the FLIP, I think it makes sense. The operator
> resource management and dynamic slot allocation do not have much dependency
> on each other.
>
> - Regarding the default slot size, I think this is similar to FLIP-49 [1]
> where we want all the deriving happens at one place. I think it would be
> nice to pass the default slot size into the task executor in the same way
> that we pass in the memory pool sizes in FLIP-49 [1].
>
> - Regarding the return value of TaskExecutorGateway#requestResource, I
> think you're right. We should avoid using null as the return value. I think
> we probably should thrown an exception here.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> On Fri, Aug 16, 2019 at 2:18 PM Till Rohrmann 
> wrote:
>
> > Hi Xintong,
> >
> > thanks for drafting this FLIP. I think your proposal helps to improve the
> > execution of batch jobs more efficiently. Moreover, it enables the proper
> > integration of the Blink planner which is very important as well.
> >
> > Overall, the FLIP looks good to me. I was wondering whether it wouldn't
> > make sense to actually split it up into two FLIPs: Operator resource
> > management and dynamic slot allocation. I think these two FLIPs could be
> > seen as orthogonal and it would decrease the scope of each individual
> FLIP.
> >
> > Some smaller comments:
> >
> > - I'm not sure whether we should pass in the default slot size via an
> > environment variable. Without having unified the way how Flink components
> > are configured [1], I think it would be better to pass it in as part of
> the
> > configuration.
> > - I would avoid returning a null value from
> > TaskExecutorGateway#requestResource if it cannot be fulfilled. Either we
> > should introduce an explicit return value saying this or throw an
> > exception.
> >
> > Concerning Yangze's comments: I think you are right that it would be
> > helpful to make the selection strategy pluggable. Also batching slot
> > requests to the RM could be a good optimization. For the sake of keeping
> > the scope of this FLIP smaller I would try to tackle these things after
> the
> > initial version has been completed (without spoiling these optimization
> > opportunities). In particular batching the slot requests depends on the
> > current scheduler refactoring and could also be realized on the RM side
> > only.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-54%3A+Evolve+ConfigOption+and+Configuration
> >
> > Cheers,
> > Till
> >
> >
> >
> > On Fri, Aug 16, 2019 at 11:11 AM Yangze Guo  wrote:
> >
> > > Hi, Xintong
> > >
> > > Thanks to propose this FLIP. The general design looks good to me, +1
> > > for this feature.
> > >
> > > Since slots in the same task executor could have different resource
> > > profile, we will
> > > meet resource fragment problem. Think about this case:
> > >  - request A want 1G memory while request B & C want 0.5G memory
> > >  - There are two task executors T1 & T2 with 1G and 0.5G free memory
> > > respectively
> > > If B come first and we cut a slot from T1 for B, A must wait for the
> > > free resource from
> > > other task. But A could 

Re: [DISCUSS] Update our Roadmap

2019-08-19 Thread Marta Paes Moreira
Hey, Robert.

Updating the roadmap is something I can work on (and also have on my radar,
moving forward). Already had a quick word with Stephan and he's available
to provide support, if needed.

Marta

On Sun, Aug 18, 2019 at 4:46 PM Stephan Ewen  wrote:

> I could help with that.
>
> On Fri, Aug 16, 2019 at 2:36 PM Robert Metzger 
> wrote:
>
> > Flink 1.9 is feature freezed and almost released.
> > I guess it makes sense to update the roadmap on the website again.
> >
> > Who feels like having a good overview of what's coming up?
> >
> > On Tue, May 7, 2019 at 4:33 PM Fabian Hueske  wrote:
> >
> > > Yes, that's a very good proposal Jark.
> > > +1
> > >
> > > Best, Fabian
> > >
> > > Am Mo., 6. Mai 2019 um 16:33 Uhr schrieb Till Rohrmann <
> > > trohrm...@apache.org
> > > >:
> > >
> > > > I think this is a good idea Jark. Putting the last update date on the
> > > > roadmap would also force us to regularly update it.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Mon, May 6, 2019 at 4:14 AM Jark Wu  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > One suggestion for the roadmap:
> > > > >
> > > > > Shall we add a `latest-update-time` to the top of Roadmap page? So
> > that
> > > > > users can know this is a up-to-date Roadmap.
> > > > >
> > > > > On Thu, 2 May 2019 at 04:49, Bowen Li  wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Mon, Apr 29, 2019 at 11:41 PM jincheng sun <
> > > > sunjincheng...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > I have open the PR about add Python Table API section to the
> > > > roadmap. I
> > > > > > > appreciate if you have time to look at it. :)
> > > > > > >
> > > > > > > https://github.com/apache/flink-web/pull/204
> > > > > > >
> > > > > > > Regards,
> > > > > > > Jincheng
> > > > > > >
> > > > > > > jincheng sun  于2019年4月29日周一
> 下午11:12写道:
> > > > > > >
> > > > > > > > Sure, I will do it!I think the python table api info should
> in
> > > the
> > > > > > > >  roadmap! Thank you @Jeff @Fabian
> > > > > > > >
> > > > > > > > Fabian Hueske 于2019年4月29日 周一23:05写道:
> > > > > > > >
> > > > > > > >> Great, thanks Jeff and Timo!
> > > > > > > >>
> > > > > > > >> @Jincheng do you want to write a paragraph about the Python
> > > effort
> > > > > and
> > > > > > > >> open a PR for it?
> > > > > > > >>
> > > > > > > >> I'll remove the issue about Hadoop convenience builds
> > > > (FLINK-11266).
> > > > > > > >>
> > > > > > > >> Best, Fabian
> > > > > > > >>
> > > > > > > >> Am Mo., 29. Apr. 2019 um 16:37 Uhr schrieb Jeff Zhang <
> > > > > > zjf...@gmail.com
> > > > > > > >:
> > > > > > > >>
> > > > > > > >>> jincheng(cc) is driving the python effort, I think he can
> > help
> > > to
> > > > > > > >>> prepare it.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Fabian Hueske  于2019年4月29日周一 下午10:15写道:
> > > > > > > >>>
> > > > > > >  Hi everyone,
> > > > > > > 
> > > > > > >  Since we had no more comments on this thread, I think we
> > > proceed
> > > > > to
> > > > > > >  update the roadmap.
> > > > > > > 
> > > > > > >  @Jeff Zhang  I agree, we should add the
> > > > Python
> > > > > > >  efforts to the roadmap.
> > > > > > >  Do you want to prepare a short paragraph that we can add
> to
> > > the
> > > > > > >  document?
> > > > > > > 
> > > > > > >  Best, Fabian
> > > > > > > 
> > > > > > >  Am Mi., 17. Apr. 2019 um 15:04 Uhr schrieb Jeff Zhang <
> > > > > > > zjf...@gmail.com
> > > > > > >  >:
> > > > > > > 
> > > > > > > > Hi Fabian,
> > > > > > > >
> > > > > > > > One thing missing is python api and python udf, we
> already
> > > > > > discussed
> > > > > > > > it in
> > > > > > > > community, and it is very close to reach consensus.
> > > > > > > >
> > > > > > > >
> > > > > > > > Fabian Hueske  于2019年4月17日周三
> 下午7:51写道:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > We recently added a roadmap to our project website [1]
> > and
> > > > > > decided
> > > > > > > to
> > > > > > > > > update it after every release. Flink 1.8.0 was
> released a
> > > few
> > > > > > days
> > > > > > > > ago, so
> > > > > > > > > I think it we should check and remove from the roadmap
> > what
> > > > was
> > > > > > > > achieved so
> > > > > > > > > far and add features / improvements that we plan for
> the
> > > > > future.
> > > > > > > > >
> > > > > > > > > I had a look at the roadmap and found that
> > > > > > > > >
> > > > > > > > > > We are changing the build setup to not bundle Hadoop
> by
> > > > > > default,
> > > > > > > > but
> > > > > > > > > rather offer pre-packaged
> > > > > > > > > > Hadoop libraries for the use with Yarn, HDFS, etc. as
> > > > > > convenience
> > > > > > > > > downloads FLINK-11266 <
> > > > > > > > https://issues.apache.org/jira/browse/FLINK-11266>.
> > > > > > > > >
> > 

Re: [DISCUSS][CODE STYLE] Breaking long function argument lists and chained method calls

2019-08-19 Thread Andrey Zagrebin
Hi Everybody,

Thanks for your feedback guys and sorry for not getting back to the
discussion for some time.

@SHI Xiaogang
About breaking lines for thrown exceptions:
Indeed that would prevent growing the throw clause indefinitely.
I am a bit concerned about putting the right parenthesis and/or throw
clause on the next line
because in general we do not it and there are a lot of variations of how
and what to put to the next line so it needs explicit memorising.
Also, we do not have many checked exceptions and usually avoid them.
Although I am not a big fan of many function arguments either but this
seems to be a bigger problem in the code base.
I would be ok to not enforce anything for exceptions atm.

@Chesnay Schepler 
Thanks for mentioning automatic checks.
Indeed, pointing out this kind of style issues during PR reviews is very
tedious
and cannot really force it without automated tools.
I would still consider the outcome of this discussion as a soft
recommendation atm (which we also have for some other things in the code
style draft).
We need more investigation about how to enforce things. I am not so
knowledgable about code style/IDE checks.
>From the first glance I also do not see a simple way. If somebody has more
insight please share your experience.

@Biao Liu 
Line length limitation:
I do not see anything for Java, only for Scala: 100 (also enforced by build
AFAIK).
>From what I heard there has been already some discussion about the hard
limit for the line length.
Although quite some people are in favour of it (including me) and seems to
be a nice limitation,
there are some practical implication about it.
Historically, Flink did not have any code style checks and huge chunks of
code base have to be reformatted destroying the commit history.
Another thing is value for the limit. Nowadays, we have wide screens and do
not often even need to scroll.
Nevertheless, we can kick off another discussion about the line length
limit and enforcing it.
Atm I see people to adhere to a soft recommendation of 120 line length for
Java because it is usually a bit more verbose comparing to Scala.

*Question 1*:
I would be ok to always break line if there is more than one chained call.
There are a lot of places where this is only one short call I would not
break line in this case.
If it is too confusing I would be ok to stick to the rule to break either
all or none.
Thanks for pointing out this explicitly: For a chained method calls, the
new line should be started with the dot.
I think it should be also part of the rule if forced.

*Question 2:*
The indent of new line should be 1 tab or 2 tabs? (I assume it matters only
for function arguments)
This is a good point which again probably deserves a separate thread.
We also had an internal discussion about it. I would be also in favour of
resolving it into one way.
Atm we indeed have 2 ways in our code base which are again soft
recommendations.
The problem is mostly with enforcing it automatically.
The approach with extra indentation also needs IDE setup otherwise it is
annoying
that after every function cut/paste, e.g. Idea changes the format to one
indentation automatically and often people do not notice it.

I suggest we still finish this discussion to a point of achieving a soft
recommendation which we can also reconsider
when there are more ideas about automatically enforcing these things.

Best,
Andrey

On Sat, Aug 3, 2019 at 7:51 AM SHI Xiaogang  wrote:

> Hi Chesnay,
>
> Thanks a lot for your reminder.
>
> For Intellij settings, the style i proposed can be configured as below
> * Method declaration parameters: chop down if long
> * align when multiple: YES
> * new line after '(': YES
> * place ')' on new line: YES
> * Method call arguments: chop down if long
> * align when multiple: YES
> * take priority over call chain wrapping: YES
> * new line after '(': YES
> * place ')' on new line: YES
> * Throws list: chop down if long
> * align when multiline: YES
>
> As far as i know, there does not exist standard checks for the alignment of
> method parameters or arguments. It needs further investigation to see
> whether we can validate these styles via customized checks.
>
>
> Biao Liu  于2019年8月2日周五 下午4:00写道:
>
> > Hi Andrey,
> >
> > Thank you for bringing us this discussion.
> >
> > I would like to make some details clear. Correct me if I am wrong.
> >
> > The guide draft [1] says the line length is limited in 100 characters.
> From
> > my understanding, this discussion suggests if there is more than 100
> > characters in one line (both Scala and Java), we should start a new line
> > (or lines).
> >
> > *Question 1*: If a line does not exceed 100 characters, should we break
> the
> > chained calls into lines? Currently the chained calls always been broken
> > into lines even it's not too long. Does it just a suggestion or a
> > limitation?
> > I prefer it's a limitation which must be respected. And we should always
> > break the 

Re: Review of pull request

2019-08-19 Thread Rishindra Kumar
Thank you Gordon. I will have a look.

-- 
*Maddila Rishindra Kumar*
*Software Engineer*
*Walmartlabs India*
*Contact No: +919967379528 | Alternate E-mail
ID: rishindra.madd...@walmartlabs.com *


Re: Review of pull request

2019-08-19 Thread Tzu-Li (Gordon) Tai
Hi Rishindra,

Thanks for the pull request. I've left a comment on it.

Cheers,
Gordon

On Mon, Aug 19, 2019 at 10:16 AM Till Rohrmann  wrote:

> Hi Rishindra,
>
> I've pulled in Gordon who has worked on the Elasticsearch connector. He
> might be able to review the PR.
>
> Cheers,
> Till
>
> On Mon, Aug 19, 2019 at 8:10 AM Rishindra Kumar <
> rishindrakuma...@gmail.com> wrote:
>
>> Hi,
>>
>> I created pull request with the change I proposed in the comment. Could
>> someone please review it?
>> https://github.com/apache/flink/pull/9468
>>
>> --
>> *Maddila Rishindra Kumar*
>> *Software Engineer*
>> *Walmartlabs India*
>> *Contact No: +919967379528 | Alternate E-mail
>> ID: rishindra.madd...@walmartlabs.com > >*
>>
>


Re: [DISCUSS] Reducing build times

2019-08-19 Thread Aljoscha Krettek
I did a quick test: a normal "mvn clean install -DskipTests -Drat.skip=true 
-Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my machine takes about 14 
minutes. After removing all mentions of maven-shade-plugin the build time goes 
down to roughly 11.5 minutes. (Obviously the resulting Flink won’t work, 
because some expected stuff is not packaged and most of the end-to-end tests 
use the shade plugin to package the jars for testing.

Aljoscha

> On 18. Aug 2019, at 19:52, Robert Metzger  wrote:
> 
> Hi all,
> 
> I wanted to understand the impact of the hardware we are using for running
> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory [1].
> They are using Google Cloud Compute Engine *n1-standard-2* instances.
> Running a full "mvn clean verify" takes *03:32 h* on such a machine type.
> 
> Running the same workload on a 32 virtual cores, 64 gb machine, takes *1:21
> h*.
> 
> What is interesting are the per-module build time differences.
> Modules which are parallelizing tests well greatly benefit from the
> additional cores:
> "flink-tests" 36:51 min vs 4:33 min
> "flink-runtime" 23:41 min vs 3:47 min
> "flink-table-planner" 15:54 min vs 3:13 min
> 
> On the other hand, we have modules which are not parallel at all:
> "flink-connector-kafka": 16:32 min vs 15:19 min
> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> Also, the checkstyle plugin is not scaling at all.
> 
> Chesnay reported some significant speedups by reusing forks.
> I don't know how much effort it would be to make the Kafka tests
> parallelizable. In total, they currently use 30 minutes on the big machine
> (while 31 CPUs are idling :) )
> 
> Let me know what you think about these results. If the community is
> generally interested in further investigating into that direction, I could
> look into software to orchestrate this, as well as sponsors for such an
> infrastructure.
> 
> [1] https://docs.travis-ci.com/user/reference/overview/
> 
> 
> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler  wrote:
> 
>> @Aljoscha Shading takes a few minutes for a full build; you can see this
>> quite easily by looking at the compile step in the misc profile
>> ; all modules that
>> longer than a fraction of a section are usually caused by shading lots
>> of classes. Note that I cannot tell you how much of this is spent on
>> relocations, and how much on writing the jar.
>> 
>> Personally, I'd very much like us to move all shading to flink-shaded;
>> this would finally allows us to use newer maven versions without needing
>> cumbersome workarounds for flink-dist. However, this isn't a trivial
>> affair in some cases; IIRC calcite could be difficult to handle.
>> 
>> On another note, this would also simplify switching the main repo to
>> another build system, since you would no longer had to deal with
>> relocations, just packaging + merging NOTICE files.
>> 
>> @BowenLi I disagree, flink-shaded does not include any tests,  API
>> compatibility checks, checkstyle, layered shading (e.g., flink-runtime
>> and flink-dist, where both relocate dependencies and one is bundled by
>> the other), and, most importantly, CI (and really, without CI being
>> covered in a PoC there's nothing to discuss).
>> 
>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
>>> Speaking of flink-shaded, do we have any idea what the impact of shading
>> is on the build time? We could get rid of shading completely in the Flink
>> main repository by moving everything that we shade to flink-shaded.
>>> 
>>> Aljoscha
>>> 
 On 16. Aug 2019, at 14:58, Bowen Li  wrote:
 
 +1 to Till's points on #2 and #5, especially the potential
>> non-disruptive,
 gradual migration approach if we decide to go that route.
 
 To add on, I want to point it out that we can actually start with
 flink-shaded project [1] which is a perfect candidate for PoC. It's of
>> much
 smaller size, totally isolated from and not interfered with flink
>> project
 [2], and it actually covers most of our practical feature requirements
>> for
 a build tool - all making it an ideal experimental field.
 
 [1] https://github.com/apache/flink-shaded
 [2] https://github.com/apache/flink
 
 
 On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann 
>> wrote:
 
> For the sake of keeping the discussion focused and not cluttering the
> discussion thread I would suggest to split the detailed reporting for
> reusing JVMs to a separate thread and cross linking it from here.
> 
> Cheers,
> Till
> 
> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
> wrote:
> 
>> Update:
>> 
>> TL;DR: table-planner is a good candidate for enabling fork reuse right
>> away, while flink-tests has the potential for huge savings, but we
>> have
>> to figure out some issues first.
>> 
>> 
>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
>> 

Re: Cwiki edit access

2019-08-19 Thread Till Rohrmann
Hi Thomas,

I've given you access. You should be able to access it now with your Apache
account. Please let me know if something is not working.

Cheers,
Till

On Mon, Aug 19, 2019 at 6:15 AM Thomas Weise  wrote:

> Hi,
>
> I would like to be able to edit pages in the Confluence Flink space. Can
> someone give me access please?
>
> Thanks
>


Re: Review of pull request

2019-08-19 Thread Till Rohrmann
Hi Rishindra,

I've pulled in Gordon who has worked on the Elasticsearch connector. He
might be able to review the PR.

Cheers,
Till

On Mon, Aug 19, 2019 at 8:10 AM Rishindra Kumar 
wrote:

> Hi,
>
> I created pull request with the change I proposed in the comment. Could
> someone please review it?
> https://github.com/apache/flink/pull/9468
>
> --
> *Maddila Rishindra Kumar*
> *Software Engineer*
> *Walmartlabs India*
> *Contact No: +919967379528 | Alternate E-mail
> ID: rishindra.madd...@walmartlabs.com *
>


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Stephan Ewen
+1 for Gordon's approach.

If we do that, we can probably skip re-testing everything and mainly need
to verify the release artifacts (signatures, build from source, etc.).

If we open the RC up for changes, I fear a lot of small issues will rush in
and destabilize the candidate again, meaning we have to do another larger
testing effort.



On Mon, Aug 19, 2019 at 9:48 AM Becket Qin  wrote:

> Hi Gordon,
>
> I remember we mentioned earlier that if there is an additional RC, we can
> piggyback the GCP PubSub API change (
> https://issues.apache.org/jira/browse/FLINK-13231). It is a small patch to
> avoid future API change. So should be able to merge it very shortly. Would
> it be possible to include that into RC3 as well?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai 
> wrote:
>
> > Hi,
> >
> > https://issues.apache.org/jira/browse/FLINK-13752 turns out to be an
> > actual
> > blocker, so we would have to close this RC now in favor of a new one.
> >
> > Since we are already quite past the planned release time for 1.9.0, I
> would
> > like to limit the new changes included in RC3 to only the following:
> > - https://issues.apache.org/jira/browse/FLINK-13752
> > - Fix license and notice file issues that Kurt had found with
> > flink-runtime-web and flink-state-processing-api
> >
> > This means that I will not be creating RC3 with the release-1.9 branch as
> > is, but essentially only cherry-picking the above mentioned changes on
> top
> > of RC2.
> > The minimal set of changes on top of RC2 should allow us to carry most if
> > not all of the already existing votes without another round of extensive
> > testing, and allow us to have a shortened voting time.
> >
> > I understand that there are other issues mentioned in this thread that
> are
> > already spotted and merged to release-1.9, especially for the Blink
> planner
> > and DDL, but I suggest not to include them in RC3.
> > I think it would be better to collect all the remaining issues for those
> > over a period of time, and include them as 1.9.1 which can ideally also
> > happen a few weeks soon after 1.9.0.
> >
> > What do you think? If there are not objections, I would proceed with this
> > plan and push out a new RC by the end of today (Aug. 19th CET).
> >
> > Regards,
> > Gordon
> >
> > On Mon, Aug 19, 2019 at 4:09 AM Zili Chen  wrote:
> >
> > > We should investigate the performance regression but regardless the
> > > regression I vote +1
> > >
> > > Have verified following things
> > >
> > > - Jobs running on YARN x (Session & Per Job) with high-availability
> > > enabled.
> > > - Simulate JM and TM failures.
> > > - Simulate temporary network partition.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Stephan Ewen  于2019年8月18日周日 下午10:12写道:
> > >
> > > > For reference, this is the JIRA issue about the regression in
> question:
> > > >
> > > > https://issues.apache.org/jira/browse/FLINK-13752
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma 
> > wrote:
> > > >
> > > > > Hi, till
> > > > > I can send the job to you offline.
> > > > > It is just a datastream job and does not use
> > > > TwoInputSelectableStreamTask.
> > > > > A->B
> > > > >  \
> > > > >C
> > > > >  /
> > > > > D->E
> > > > > Best,
> > > > > Guowei
> > > > >
> > > > >
> > > > > Till Rohrmann  于2019年8月16日周五 下午4:34写道:
> > > > >
> > > > > > Thanks for reporting this issue Guowei. Could you share a bit
> more
> > > > > details
> > > > > > what the job exactly does and which operators it uses? Does the
> job
> > > > uses
> > > > > > the new `TwoInputSelectableStreamTask` which might cause the
> > > > performance
> > > > > > regression?
> > > > > >
> > > > > > I think it is important to understand where the problem comes
> from
> > > > before
> > > > > > we proceed with the release.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma  >
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > -1
> > > > > > > We have a benchmark job, which includes a two-input operator.
> > > > > > > This job has a big performance regression using 1.9 compared to
> > > 1.8.
> > > > > > > It's still not very clear why this regression happens.
> > > > > > >
> > > > > > > Best,
> > > > > > > Guowei
> > > > > > >
> > > > > > >
> > > > > > > Yu Li  于2019年8月16日周五 下午3:27写道:
> > > > > > >
> > > > > > > > +1 (non-binding)
> > > > > > > >
> > > > > > > > - checked release notes: OK
> > > > > > > > - checked sums and signatures: OK
> > > > > > > > - source release
> > > > > > > >  - contains no binaries: OK
> > > > > > > >  - contains no 1.9-SNAPSHOT references: OK
> > > > > > > >  - build from source: OK (8u102)
> > > > > > > >  - mvn clean verify: OK (8u102)
> > > > > > > > - binary release
> > > > > > > >  - no examples appear to be missing
> > > > > > > >  - started a cluster; WebUI reachable, example ran
> > > successfully
> > > > > 

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Till Rohrmann
+1 for only cherry picking FLINK-13752 and the LICENSE fixes into RC 3.

Cheers,
Till

On Mon, Aug 19, 2019 at 9:48 AM Becket Qin  wrote:

> Hi Gordon,
>
> I remember we mentioned earlier that if there is an additional RC, we can
> piggyback the GCP PubSub API change (
> https://issues.apache.org/jira/browse/FLINK-13231). It is a small patch to
> avoid future API change. So should be able to merge it very shortly. Would
> it be possible to include that into RC3 as well?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai 
> wrote:
>
> > Hi,
> >
> > https://issues.apache.org/jira/browse/FLINK-13752 turns out to be an
> > actual
> > blocker, so we would have to close this RC now in favor of a new one.
> >
> > Since we are already quite past the planned release time for 1.9.0, I
> would
> > like to limit the new changes included in RC3 to only the following:
> > - https://issues.apache.org/jira/browse/FLINK-13752
> > - Fix license and notice file issues that Kurt had found with
> > flink-runtime-web and flink-state-processing-api
> >
> > This means that I will not be creating RC3 with the release-1.9 branch as
> > is, but essentially only cherry-picking the above mentioned changes on
> top
> > of RC2.
> > The minimal set of changes on top of RC2 should allow us to carry most if
> > not all of the already existing votes without another round of extensive
> > testing, and allow us to have a shortened voting time.
> >
> > I understand that there are other issues mentioned in this thread that
> are
> > already spotted and merged to release-1.9, especially for the Blink
> planner
> > and DDL, but I suggest not to include them in RC3.
> > I think it would be better to collect all the remaining issues for those
> > over a period of time, and include them as 1.9.1 which can ideally also
> > happen a few weeks soon after 1.9.0.
> >
> > What do you think? If there are not objections, I would proceed with this
> > plan and push out a new RC by the end of today (Aug. 19th CET).
> >
> > Regards,
> > Gordon
> >
> > On Mon, Aug 19, 2019 at 4:09 AM Zili Chen  wrote:
> >
> > > We should investigate the performance regression but regardless the
> > > regression I vote +1
> > >
> > > Have verified following things
> > >
> > > - Jobs running on YARN x (Session & Per Job) with high-availability
> > > enabled.
> > > - Simulate JM and TM failures.
> > > - Simulate temporary network partition.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Stephan Ewen  于2019年8月18日周日 下午10:12写道:
> > >
> > > > For reference, this is the JIRA issue about the regression in
> question:
> > > >
> > > > https://issues.apache.org/jira/browse/FLINK-13752
> > > >
> > > >
> > > > On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma 
> > wrote:
> > > >
> > > > > Hi, till
> > > > > I can send the job to you offline.
> > > > > It is just a datastream job and does not use
> > > > TwoInputSelectableStreamTask.
> > > > > A->B
> > > > >  \
> > > > >C
> > > > >  /
> > > > > D->E
> > > > > Best,
> > > > > Guowei
> > > > >
> > > > >
> > > > > Till Rohrmann  于2019年8月16日周五 下午4:34写道:
> > > > >
> > > > > > Thanks for reporting this issue Guowei. Could you share a bit
> more
> > > > > details
> > > > > > what the job exactly does and which operators it uses? Does the
> job
> > > > uses
> > > > > > the new `TwoInputSelectableStreamTask` which might cause the
> > > > performance
> > > > > > regression?
> > > > > >
> > > > > > I think it is important to understand where the problem comes
> from
> > > > before
> > > > > > we proceed with the release.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma  >
> > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > -1
> > > > > > > We have a benchmark job, which includes a two-input operator.
> > > > > > > This job has a big performance regression using 1.9 compared to
> > > 1.8.
> > > > > > > It's still not very clear why this regression happens.
> > > > > > >
> > > > > > > Best,
> > > > > > > Guowei
> > > > > > >
> > > > > > >
> > > > > > > Yu Li  于2019年8月16日周五 下午3:27写道:
> > > > > > >
> > > > > > > > +1 (non-binding)
> > > > > > > >
> > > > > > > > - checked release notes: OK
> > > > > > > > - checked sums and signatures: OK
> > > > > > > > - source release
> > > > > > > >  - contains no binaries: OK
> > > > > > > >  - contains no 1.9-SNAPSHOT references: OK
> > > > > > > >  - build from source: OK (8u102)
> > > > > > > >  - mvn clean verify: OK (8u102)
> > > > > > > > - binary release
> > > > > > > >  - no examples appear to be missing
> > > > > > > >  - started a cluster; WebUI reachable, example ran
> > > successfully
> > > > > > > > - repository appears to contain all expected artifacts
> > > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > Yu
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, 16 Aug 2019 at 06:06, Bowen Li 
> > > > wrote:
> > > > > > > >
> > > > > 

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-19 Thread Ufuk Celebi
I'm late to the party... Welcome and congrats! :-)

– Ufuk


On Mon, Aug 19, 2019 at 9:26 AM Andrey Zagrebin 
wrote:

> Hi Everybody!
>
> Thanks a lot for the warn welcome!
> I am really happy about joining Flink committer team and hope to help the
> project to grow more.
>
> Cheers,
> Andrey
>
> On Fri, Aug 16, 2019 at 11:10 AM Terry Wang  wrote:
>
> > Congratulations Andrey!
> >
> > Best,
> > Terry Wang
> >
> >
> >
> > 在 2019年8月15日,下午9:27,Hequn Cheng  写道:
> >
> > Congratulations Andrey!
> >
> > On Thu, Aug 15, 2019 at 3:30 PM Fabian Hueske  wrote:
> >
> >> Congrats Andrey!
> >>
> >> Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao  >:
> >>
> >> > Congratulations Andrey, well deserved!
> >> >
> >> > Best,
> >> > Gary
> >> >
> >> > On Thu, Aug 15, 2019 at 7:50 AM Bowen Li  wrote:
> >> >
> >> > > Congratulations Andrey!
> >> > >
> >> > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong 
> >> wrote:
> >> > >
> >> > >> Congratulations Andrey!
> >> > >>
> >> > >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok 
> >> wrote:
> >> > >>
> >> > >> > Congratulations Andrey!
> >> > >> > At 2019-08-14 21:26:37, "Till Rohrmann" 
> >> wrote:
> >> > >> > >Hi everyone,
> >> > >> > >
> >> > >> > >I'm very happy to announce that Andrey Zagrebin accepted the
> >> offer of
> >> > >> the
> >> > >> > >Flink PMC to become a committer of the Flink project.
> >> > >> > >
> >> > >> > >Andrey has been an active community member for more than 15
> >> months.
> >> > He
> >> > >> has
> >> > >> > >helped shaping numerous features such as State TTL, FRocksDB
> >> release,
> >> > >> > >Shuffle service abstraction, FLIP-1, result partition management
> >> and
> >> > >> > >various fixes/improvements. He's also frequently helping out on
> >> the
> >> > >> > >user@f.a.o mailing lists.
> >> > >> > >
> >> > >> > >Congratulations Andrey!
> >> > >> > >
> >> > >> > >Best, Till
> >> > >> > >(on behalf of the Flink PMC)
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >
> >
>


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Becket Qin
Hi Gordon,

I remember we mentioned earlier that if there is an additional RC, we can
piggyback the GCP PubSub API change (
https://issues.apache.org/jira/browse/FLINK-13231). It is a small patch to
avoid future API change. So should be able to merge it very shortly. Would
it be possible to include that into RC3 as well?

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 19, 2019 at 9:43 AM Tzu-Li (Gordon) Tai 
wrote:

> Hi,
>
> https://issues.apache.org/jira/browse/FLINK-13752 turns out to be an
> actual
> blocker, so we would have to close this RC now in favor of a new one.
>
> Since we are already quite past the planned release time for 1.9.0, I would
> like to limit the new changes included in RC3 to only the following:
> - https://issues.apache.org/jira/browse/FLINK-13752
> - Fix license and notice file issues that Kurt had found with
> flink-runtime-web and flink-state-processing-api
>
> This means that I will not be creating RC3 with the release-1.9 branch as
> is, but essentially only cherry-picking the above mentioned changes on top
> of RC2.
> The minimal set of changes on top of RC2 should allow us to carry most if
> not all of the already existing votes without another round of extensive
> testing, and allow us to have a shortened voting time.
>
> I understand that there are other issues mentioned in this thread that are
> already spotted and merged to release-1.9, especially for the Blink planner
> and DDL, but I suggest not to include them in RC3.
> I think it would be better to collect all the remaining issues for those
> over a period of time, and include them as 1.9.1 which can ideally also
> happen a few weeks soon after 1.9.0.
>
> What do you think? If there are not objections, I would proceed with this
> plan and push out a new RC by the end of today (Aug. 19th CET).
>
> Regards,
> Gordon
>
> On Mon, Aug 19, 2019 at 4:09 AM Zili Chen  wrote:
>
> > We should investigate the performance regression but regardless the
> > regression I vote +1
> >
> > Have verified following things
> >
> > - Jobs running on YARN x (Session & Per Job) with high-availability
> > enabled.
> > - Simulate JM and TM failures.
> > - Simulate temporary network partition.
> >
> > Best,
> > tison.
> >
> >
> > Stephan Ewen  于2019年8月18日周日 下午10:12写道:
> >
> > > For reference, this is the JIRA issue about the regression in question:
> > >
> > > https://issues.apache.org/jira/browse/FLINK-13752
> > >
> > >
> > > On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma 
> wrote:
> > >
> > > > Hi, till
> > > > I can send the job to you offline.
> > > > It is just a datastream job and does not use
> > > TwoInputSelectableStreamTask.
> > > > A->B
> > > >  \
> > > >C
> > > >  /
> > > > D->E
> > > > Best,
> > > > Guowei
> > > >
> > > >
> > > > Till Rohrmann  于2019年8月16日周五 下午4:34写道:
> > > >
> > > > > Thanks for reporting this issue Guowei. Could you share a bit more
> > > > details
> > > > > what the job exactly does and which operators it uses? Does the job
> > > uses
> > > > > the new `TwoInputSelectableStreamTask` which might cause the
> > > performance
> > > > > regression?
> > > > >
> > > > > I think it is important to understand where the problem comes from
> > > before
> > > > > we proceed with the release.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma 
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > -1
> > > > > > We have a benchmark job, which includes a two-input operator.
> > > > > > This job has a big performance regression using 1.9 compared to
> > 1.8.
> > > > > > It's still not very clear why this regression happens.
> > > > > >
> > > > > > Best,
> > > > > > Guowei
> > > > > >
> > > > > >
> > > > > > Yu Li  于2019年8月16日周五 下午3:27写道:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > - checked release notes: OK
> > > > > > > - checked sums and signatures: OK
> > > > > > > - source release
> > > > > > >  - contains no binaries: OK
> > > > > > >  - contains no 1.9-SNAPSHOT references: OK
> > > > > > >  - build from source: OK (8u102)
> > > > > > >  - mvn clean verify: OK (8u102)
> > > > > > > - binary release
> > > > > > >  - no examples appear to be missing
> > > > > > >  - started a cluster; WebUI reachable, example ran
> > successfully
> > > > > > > - repository appears to contain all expected artifacts
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Yu
> > > > > > >
> > > > > > >
> > > > > > > On Fri, 16 Aug 2019 at 06:06, Bowen Li 
> > > wrote:
> > > > > > >
> > > > > > > > Hi Jark,
> > > > > > > >
> > > > > > > > Thanks for letting me know that it's been like this in
> previous
> > > > > > releases.
> > > > > > > > Though I don't think that's the right behavior, it can be
> > > discussed
> > > > > for
> > > > > > > > later release. Thus I retract my -1 for RC2.
> > > > > > > >
> > > > > > > > Bowen
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2019 at 7:49 PM Jark Wu 
> > > 

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-19 Thread Tzu-Li (Gordon) Tai
Hi,

https://issues.apache.org/jira/browse/FLINK-13752 turns out to be an actual
blocker, so we would have to close this RC now in favor of a new one.

Since we are already quite past the planned release time for 1.9.0, I would
like to limit the new changes included in RC3 to only the following:
- https://issues.apache.org/jira/browse/FLINK-13752
- Fix license and notice file issues that Kurt had found with
flink-runtime-web and flink-state-processing-api

This means that I will not be creating RC3 with the release-1.9 branch as
is, but essentially only cherry-picking the above mentioned changes on top
of RC2.
The minimal set of changes on top of RC2 should allow us to carry most if
not all of the already existing votes without another round of extensive
testing, and allow us to have a shortened voting time.

I understand that there are other issues mentioned in this thread that are
already spotted and merged to release-1.9, especially for the Blink planner
and DDL, but I suggest not to include them in RC3.
I think it would be better to collect all the remaining issues for those
over a period of time, and include them as 1.9.1 which can ideally also
happen a few weeks soon after 1.9.0.

What do you think? If there are not objections, I would proceed with this
plan and push out a new RC by the end of today (Aug. 19th CET).

Regards,
Gordon

On Mon, Aug 19, 2019 at 4:09 AM Zili Chen  wrote:

> We should investigate the performance regression but regardless the
> regression I vote +1
>
> Have verified following things
>
> - Jobs running on YARN x (Session & Per Job) with high-availability
> enabled.
> - Simulate JM and TM failures.
> - Simulate temporary network partition.
>
> Best,
> tison.
>
>
> Stephan Ewen  于2019年8月18日周日 下午10:12写道:
>
> > For reference, this is the JIRA issue about the regression in question:
> >
> > https://issues.apache.org/jira/browse/FLINK-13752
> >
> >
> > On Fri, Aug 16, 2019 at 10:57 AM Guowei Ma  wrote:
> >
> > > Hi, till
> > > I can send the job to you offline.
> > > It is just a datastream job and does not use
> > TwoInputSelectableStreamTask.
> > > A->B
> > >  \
> > >C
> > >  /
> > > D->E
> > > Best,
> > > Guowei
> > >
> > >
> > > Till Rohrmann  于2019年8月16日周五 下午4:34写道:
> > >
> > > > Thanks for reporting this issue Guowei. Could you share a bit more
> > > details
> > > > what the job exactly does and which operators it uses? Does the job
> > uses
> > > > the new `TwoInputSelectableStreamTask` which might cause the
> > performance
> > > > regression?
> > > >
> > > > I think it is important to understand where the problem comes from
> > before
> > > > we proceed with the release.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 16, 2019 at 10:27 AM Guowei Ma 
> > wrote:
> > > >
> > > > > Hi,
> > > > > -1
> > > > > We have a benchmark job, which includes a two-input operator.
> > > > > This job has a big performance regression using 1.9 compared to
> 1.8.
> > > > > It's still not very clear why this regression happens.
> > > > >
> > > > > Best,
> > > > > Guowei
> > > > >
> > > > >
> > > > > Yu Li  于2019年8月16日周五 下午3:27写道:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - checked release notes: OK
> > > > > > - checked sums and signatures: OK
> > > > > > - source release
> > > > > >  - contains no binaries: OK
> > > > > >  - contains no 1.9-SNAPSHOT references: OK
> > > > > >  - build from source: OK (8u102)
> > > > > >  - mvn clean verify: OK (8u102)
> > > > > > - binary release
> > > > > >  - no examples appear to be missing
> > > > > >  - started a cluster; WebUI reachable, example ran
> successfully
> > > > > > - repository appears to contain all expected artifacts
> > > > > >
> > > > > > Best Regards,
> > > > > > Yu
> > > > > >
> > > > > >
> > > > > > On Fri, 16 Aug 2019 at 06:06, Bowen Li 
> > wrote:
> > > > > >
> > > > > > > Hi Jark,
> > > > > > >
> > > > > > > Thanks for letting me know that it's been like this in previous
> > > > > releases.
> > > > > > > Though I don't think that's the right behavior, it can be
> > discussed
> > > > for
> > > > > > > later release. Thus I retract my -1 for RC2.
> > > > > > >
> > > > > > > Bowen
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 7:49 PM Jark Wu 
> > wrote:
> > > > > > >
> > > > > > > > Hi Bowen,
> > > > > > > >
> > > > > > > > Thanks for reporting this.
> > > > > > > > However, I don't think this is an issue. IMO, it is by
> design.
> > > > > > > > The `tEnv.listUserDefinedFunctions()` in Table API and `show
> > > > > > functions;`
> > > > > > > in
> > > > > > > > SQL CLI are intended to return only the registered UDFs, not
> > > > > including
> > > > > > > > built-in functions.
> > > > > > > > This is also the behavior in previous versions.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jark
> > > > > > > >
> > > > > > > > On Fri, 16 Aug 2019 at 06:52, Bowen Li 
> > > > wrote:
> > > > > > > >
> > > > > > > > > -1 for 

Re: [VOTE] FLIP-51: Rework of the Expression Design

2019-08-19 Thread JingsongLee
Thanks for the votes!
We have 4 +1's from Timo, Jark, Dawid and Aljoscha.
Thank you for the vote.

Since there are no disapproving votes and the voting time has passed, I
will see this FLIP as approved to be adopted into Apache Flink.

Best,
Jingsong Lee


--
From:Aljoscha Krettek 
Send Time:2019年8月16日(星期五) 15:11
To:dev 
Subject:Re: [VOTE] FLIP-51: Rework of the Expression Design

+1

This seems to be a good refactoring/cleanup step to me!

> On 16. Aug 2019, at 10:59, Dawid Wysakowicz  wrote:
> 
> +1 from my side
> 
> Best,
> 
> Dawid
> 
> On 16/08/2019 10:31, Jark Wu wrote:
>> +1 from my side.
>> 
>> Thanks Jingsong for driving this.
>> 
>> Best,
>> Jark
>> 
>> On Thu, 15 Aug 2019 at 22:09, Timo Walther  wrote:
>> 
>>> +1 for this.
>>> 
>>> Thanks,
>>> Timo
>>> 
>>> Am 15.08.19 um 15:57 schrieb JingsongLee:
 Hi Flink devs,
 
 I would like to start the voting for FLIP-51 Rework of the Expression
  Design.
 
 FLIP wiki:
 
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
 Discussion thread:
 
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html
 Google Doc:
 
>>> https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing
 Thanks,
 
 Best,
 Jingsong Lee
>>> 
>>> 
> 



Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-19 Thread Andrey Zagrebin
Hi Everybody!

Thanks a lot for the warn welcome!
I am really happy about joining Flink committer team and hope to help the
project to grow more.

Cheers,
Andrey

On Fri, Aug 16, 2019 at 11:10 AM Terry Wang  wrote:

> Congratulations Andrey!
>
> Best,
> Terry Wang
>
>
>
> 在 2019年8月15日,下午9:27,Hequn Cheng  写道:
>
> Congratulations Andrey!
>
> On Thu, Aug 15, 2019 at 3:30 PM Fabian Hueske  wrote:
>
>> Congrats Andrey!
>>
>> Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao :
>>
>> > Congratulations Andrey, well deserved!
>> >
>> > Best,
>> > Gary
>> >
>> > On Thu, Aug 15, 2019 at 7:50 AM Bowen Li  wrote:
>> >
>> > > Congratulations Andrey!
>> > >
>> > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong 
>> wrote:
>> > >
>> > >> Congratulations Andrey!
>> > >>
>> > >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok 
>> wrote:
>> > >>
>> > >> > Congratulations Andrey!
>> > >> > At 2019-08-14 21:26:37, "Till Rohrmann" 
>> wrote:
>> > >> > >Hi everyone,
>> > >> > >
>> > >> > >I'm very happy to announce that Andrey Zagrebin accepted the
>> offer of
>> > >> the
>> > >> > >Flink PMC to become a committer of the Flink project.
>> > >> > >
>> > >> > >Andrey has been an active community member for more than 15
>> months.
>> > He
>> > >> has
>> > >> > >helped shaping numerous features such as State TTL, FRocksDB
>> release,
>> > >> > >Shuffle service abstraction, FLIP-1, result partition management
>> and
>> > >> > >various fixes/improvements. He's also frequently helping out on
>> the
>> > >> > >user@f.a.o mailing lists.
>> > >> > >
>> > >> > >Congratulations Andrey!
>> > >> > >
>> > >> > >Best, Till
>> > >> > >(on behalf of the Flink PMC)
>> > >> >
>> > >>
>> > >
>> >
>>
>
>


Review of pull request

2019-08-19 Thread Rishindra Kumar
Hi,

I created pull request with the change I proposed in the comment. Could
someone please review it?
https://github.com/apache/flink/pull/9468

-- 
*Maddila Rishindra Kumar*
*Software Engineer*
*Walmartlabs India*
*Contact No: +919967379528 | Alternate E-mail
ID: rishindra.madd...@walmartlabs.com *