[jira] [Created] (FLINK-13744) Improper error message when submit flink job in yarn-cluster mode without hadoop lib bundled
Jeff Zhang created FLINK-13744: -- Summary: Improper error message when submit flink job in yarn-cluster mode without hadoop lib bundled Key: FLINK-13744 URL: https://issues.apache.org/jira/browse/FLINK-13744 Project: Flink Issue Type: Improvement Components: Command Line Client Affects Versions: 1.9.0 Reporter: Jeff Zhang Here's the error message I see when I submit flink job in yarn-cluster mode. The root cause is that I didn't build flink with profile include-hadoop. But the error message here doesn't tell user what's wrong here. We should improve the error message in this case. {code:java} ➜ build-target git:(FLINK-13415) ✗ bin/flink run -m yarn-cluster examples/batch/WordCount.jar The program finished with the following exception: java.lang.RuntimeException: Could not identify hostname and port in 'yarn-cluster'. at org.apache.flink.client.ClientUtils.parseHostPortAddress(ClientUtils.java:47) at org.apache.flink.client.cli.AbstractCustomCommandLine.applyCommandLineOptionsToConfiguration(AbstractCustomCommandLine.java:83) at org.apache.flink.client.cli.DefaultCLI.createClusterDescriptor(DefaultCLI.java:60) at org.apache.flink.client.cli.DefaultCLI.createClusterDescriptor(DefaultCLI.java:35) at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:216) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13743) Port PythonTableUtils to flink-python module
Dian Fu created FLINK-13743: --- Summary: Port PythonTableUtils to flink-python module Key: FLINK-13743 URL: https://issues.apache.org/jira/browse/FLINK-13743 Project: Flink Issue Type: Task Components: API / Python Reporter: Dian Fu Fix For: 1.10.0 Currently *PythonTableUtils* is located in *flink-table-planner* module, however, it is shared by both the flink planner and the blink planner. It makes more sense to move it *flink-python* module. It also indicates that we should change it from Scala to Java. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Hi Jark, Thanks for letting me know that it's been like this in previous releases. Though I don't think that's the right behavior, it can be discussed for later release. Thus I retract my -1 for RC2. Bowen On Thu, Aug 15, 2019 at 7:49 PM Jark Wu wrote: > Hi Bowen, > > Thanks for reporting this. > However, I don't think this is an issue. IMO, it is by design. > The `tEnv.listUserDefinedFunctions()` in Table API and `show functions;` in > SQL CLI are intended to return only the registered UDFs, not including > built-in functions. > This is also the behavior in previous versions. > > Best, > Jark > > On Fri, 16 Aug 2019 at 06:52, Bowen Li wrote: > > > -1 for RC2. > > > > I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I > > think it's a blocker. The bug means currently if users call > > `tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru > > SQL would not be able to see Flink's built-in functions. > > > > I'm preparing a fix right now. > > > > Bowen > > > > > > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai > > > wrote: > > > > > Thanks for all the test efforts, verifications and votes so far. > > > > > > So far, things are looking good, but we still require one more PMC > > binding > > > vote for this RC to be the official release, so I would like to extend > > the > > > vote time for 1 more day, until *Aug. 16th 17:00 CET*. > > > > > > In the meantime, the release notes for 1.9.0 had only just been > finalized > > > [1], and could use a few more eyes before closing the vote. > > > Any help with checking if anything else should be mentioned there > > regarding > > > breaking changes / known shortcomings would be appreciated. > > > > > > Cheers, > > > Gordon > > > > > > [1] https://github.com/apache/flink/pull/9438 > > > > > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young wrote: > > > > > > > Great, then I have no other comments on legal check. > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler > > > > > wrote: > > > > > > > > > The licensing items aren't a problem; we don't care about Flink > > modules > > > > > in NOTICE files, and we don't have to update the source-release > > > > > licensing since we don't have a pre-built version of the WebUI in > the > > > > > source. > > > > > > > > > > On 15/08/2019 15:22, Kurt Young wrote: > > > > > > After going through the licenses, I found 2 suspicions but not > sure > > > if > > > > > they > > > > > > are > > > > > > valid or not. > > > > > > > > > > > > 1. flink-state-processing-api is packaged in to flink-dist jar, > but > > > not > > > > > > included in > > > > > > NOTICE-binary file (the one under the root directory) like other > > > > modules. > > > > > > 2. flink-runtime-web distributed some JavaScript dependencies > > through > > > > > source > > > > > > codes, the licenses and NOTICE file were only updated inside the > > > module > > > > > of > > > > > > flink-runtime-web, but not the NOTICE file and licenses directory > > > which > > > > > > under > > > > > > the root directory. > > > > > > > > > > > > Another minor issue I just found is: > > > > > > FLINK-13558 tries to include table examples to flink-dist, but I > > > cannot > > > > > > find it in > > > > > > the binary distribution of RC2. > > > > > > > > > > > > Best, > > > > > > Kurt > > > > > > > > > > > > > > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young > > wrote: > > > > > > > > > > > >> Hi Gordon & Timo, > > > > > >> > > > > > >> Thanks for the feedback, and I agree with it. I will document > this > > > in > > > > > the > > > > > >> release notes. > > > > > >> > > > > > >> Best, > > > > > >> Kurt > > > > > >> > > > > > >> > > > > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai < > > > > > tzuli...@apache.org> > > > > > >> wrote: > > > > > >> > > > > > >>> Hi Kurt, > > > > > >>> > > > > > >>> With the same argument as before, given that it is mentioned in > > the > > > > > >>> release > > > > > >>> announcement that it is a preview feature, I would not block > this > > > > > release > > > > > >>> because of it. > > > > > >>> Nevertheless, it would be important to mention this explicitly > in > > > the > > > > > >>> release notes [1]. > > > > > >>> > > > > > >>> Regards, > > > > > >>> Gordon > > > > > >>> > > > > > >>> [1] https://github.com/apache/flink/pull/9438 > > > > > >>> > > > > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther < > > twal...@apache.org> > > > > > wrote: > > > > > >>> > > > > > Hi Kurt, > > > > > > > > > > I agree that this is a serious bug. However, I would not block > > the > > > > > release because of this. As you said, there is a workaround > and > > > the > > > > > `execute()` works in the most common case of a single > execution. > > > We > > > > > can > > > > > fix this in a minor release shortly after. > > > > > > > > > > What do others think? > > > > > > > > > > Regards, > > > > > Ti
[jira] [Created] (FLINK-13742) Fix code generation when aggregation contains both distinct aggregate with and without filter
Jark Wu created FLINK-13742: --- Summary: Fix code generation when aggregation contains both distinct aggregate with and without filter Key: FLINK-13742 URL: https://issues.apache.org/jira/browse/FLINK-13742 Project: Flink Issue Type: Bug Components: Table SQL / Planner Reporter: Jark Wu Fix For: 1.9.1 The following test will fail when the aggregation contains {{COUNT(DISTINCT c)}} and {{COUNT(DISTINCT c) filter ...}}. {code:java} @Test def testDistinctWithMultiFilter(): Unit = { val sqlQuery = "SELECT b, " + " SUM(DISTINCT (a * 3)), " + " COUNT(DISTINCT SUBSTRING(c FROM 1 FOR 2))," + " COUNT(DISTINCT c)," + " COUNT(DISTINCT c) filter (where MOD(a, 3) = 0)," + " COUNT(DISTINCT c) filter (where MOD(a, 3) = 1) " + "FROM MyTable " + "GROUP BY b" val t = failingDataSource(StreamTestData.get3TupleData).toTable(tEnv).as('a, 'b, 'c) tEnv.registerTable("MyTable", t) val result = tEnv.sqlQuery(sqlQuery).toRetractStream[Row] val sink = new TestingRetractSink result.addSink(sink) env.execute() val expected = List( "1,3,1,1,0,1", "2,15,1,2,1,0", "3,45,3,3,1,1", "4,102,1,4,1,2", "5,195,1,5,2,1", "6,333,1,6,2,2") assertEquals(expected.sorted, sink.getRetractResults.sorted) } {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Reducing build times
Thanks Chesnay for starting this discussion. +1 for #1, it might be the easiest way to get a significant speedup. If the only reason is for isolation. I think we can fix the static fields or global state used in Flink if possible. +1 for #2, and thanks Aleksey for the prototype. I think it's a good approach which doesn't introduce too much things to maintain. +1 for #3(run CRON or e2e tests on demand). We have this requirement when reviewing some pull requests, because we don't sure whether it will broken some specific e2e test. Currently, we have to run it locally by building the whole project. Or enable CRON jobs for the pushed branch in contributor's own travis. Besides that, I think FLINK-11464[1] is also a good way to cache distributions to save a lot of download time. Best, Jark [1]: https://issues.apache.org/jira/browse/FLINK-11464 On Thu, 15 Aug 2019 at 21:47, Aleksey Pak wrote: > Hi all! > > Thanks for starting this discussion. > > I'd like to also add my 2 cents: > > +1 for #2, differential build scripts. > I've worked on the approach. And with it, I think it's possible to reduce > total build time with relatively low effort, without enforcing any new > build tool and low maintenance cost. > > You can check a proposed change (for the old CI setup, when Flink PRs were > running in Apache common CI pool) here: > https://github.com/apache/flink/pull/9065 > In the proposed change, the dependency check is not heavily hardcoded and > just uses maven's results for dependency graph analysis. > > > This approach is conceptually quite straight-forward, but has limits > since it has to be pessimistic; > i.e. a change in flink-core _must_ result > in testing all modules. > > Agree, in Flink case, there are some core modules that would trigger whole > tests run with such approach. For developers who modify such components, > the build time would be the longest. But this approach should really help > for developers who touch more-or-less independent modules. > > Even for core modules, it's possible to create "abstraction" barriers by > changing dependency graph. For example, it can look like: flink-core-api > <-- flink-core, flink-core-api <-- flink-connectors. > In that case, only change in flink-core-api would trigger whole tests run. > > +1 for #3, separating PR CI runs to different stages. > Imo, it may require more change to current CI setup, compared to #2 and > better it should not be silly. Best, if it integrates with the Flink bot > and triggers some follow up build steps only when some prerequisites are > done. > > +1 for #4, to move some tests into cron runs. > But imo, this does not scale well, it applies only to a small subset of > tests. > > +1 for #6, to use other CI service(s). > More specifically, GitHub gives build actions for free that can be used to > offload some build steps/PR checks. It can help to move out some PR checks > from the main CI build (for example: documentation builds, license checks, > code formatting checks). > > Regards, > Aleksey > > On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann > wrote: > > > Thanks for starting this discussion Chesnay. I think it has become > obvious > > to the Flink community that with the existing build setup we cannot > really > > deliver fast build times which are essential for fast iteration cycles > and > > high developer productivity. The reasons for this situation are manifold > > but it is definitely affected by Flink's project growth, not always > optimal > > tests and the inflexibility that everything needs to be built. Hence, I > > consider the reduction of build times crucial for the project's health > and > > future growth. > > > > Without necessarily voicing a strong preference for any of the presented > > suggestions, I wanted to comment on each of them: > > > > 1. This sounds promising. Could the reason why we don't reuse JVMs date > > back to the time when we still had a lot of static fields in Flink which > > made it hard to reuse JVMs and the potentially mutated global state? > > > > 2. Building hand-crafted solutions around a build system in order to > > compensate for its limitations which other build systems support out of > the > > box sounds like the not invented here syndrome to me. Reinventing the > wheel > > has historically proven to be usually not the best solution and it often > > comes with a high maintenance price tag. Moreover, it would add just > > another layer of complexity around our existing build system. I think the > > current state where we have the maven setup in pom files and for Travis > > multiple bash scripts specializing the builds to make it fit the time > limit > > is already not very transparent/easy to understand. > > > > 3. I could see this work but it also requires a very good understanding > of > > Flink of every committer because the committer needs to know which tests > > would be good to run additionally. > > > > 4. I would be against this option solely to decrease our build time. My
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Hi Bowen, Thanks for reporting this. However, I don't think this is an issue. IMO, it is by design. The `tEnv.listUserDefinedFunctions()` in Table API and `show functions;` in SQL CLI are intended to return only the registered UDFs, not including built-in functions. This is also the behavior in previous versions. Best, Jark On Fri, 16 Aug 2019 at 06:52, Bowen Li wrote: > -1 for RC2. > > I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I > think it's a blocker. The bug means currently if users call > `tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru > SQL would not be able to see Flink's built-in functions. > > I'm preparing a fix right now. > > Bowen > > > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai > wrote: > > > Thanks for all the test efforts, verifications and votes so far. > > > > So far, things are looking good, but we still require one more PMC > binding > > vote for this RC to be the official release, so I would like to extend > the > > vote time for 1 more day, until *Aug. 16th 17:00 CET*. > > > > In the meantime, the release notes for 1.9.0 had only just been finalized > > [1], and could use a few more eyes before closing the vote. > > Any help with checking if anything else should be mentioned there > regarding > > breaking changes / known shortcomings would be appreciated. > > > > Cheers, > > Gordon > > > > [1] https://github.com/apache/flink/pull/9438 > > > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young wrote: > > > > > Great, then I have no other comments on legal check. > > > > > > Best, > > > Kurt > > > > > > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler > > > wrote: > > > > > > > The licensing items aren't a problem; we don't care about Flink > modules > > > > in NOTICE files, and we don't have to update the source-release > > > > licensing since we don't have a pre-built version of the WebUI in the > > > > source. > > > > > > > > On 15/08/2019 15:22, Kurt Young wrote: > > > > > After going through the licenses, I found 2 suspicions but not sure > > if > > > > they > > > > > are > > > > > valid or not. > > > > > > > > > > 1. flink-state-processing-api is packaged in to flink-dist jar, but > > not > > > > > included in > > > > > NOTICE-binary file (the one under the root directory) like other > > > modules. > > > > > 2. flink-runtime-web distributed some JavaScript dependencies > through > > > > source > > > > > codes, the licenses and NOTICE file were only updated inside the > > module > > > > of > > > > > flink-runtime-web, but not the NOTICE file and licenses directory > > which > > > > > under > > > > > the root directory. > > > > > > > > > > Another minor issue I just found is: > > > > > FLINK-13558 tries to include table examples to flink-dist, but I > > cannot > > > > > find it in > > > > > the binary distribution of RC2. > > > > > > > > > > Best, > > > > > Kurt > > > > > > > > > > > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young > wrote: > > > > > > > > > >> Hi Gordon & Timo, > > > > >> > > > > >> Thanks for the feedback, and I agree with it. I will document this > > in > > > > the > > > > >> release notes. > > > > >> > > > > >> Best, > > > > >> Kurt > > > > >> > > > > >> > > > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai < > > > > tzuli...@apache.org> > > > > >> wrote: > > > > >> > > > > >>> Hi Kurt, > > > > >>> > > > > >>> With the same argument as before, given that it is mentioned in > the > > > > >>> release > > > > >>> announcement that it is a preview feature, I would not block this > > > > release > > > > >>> because of it. > > > > >>> Nevertheless, it would be important to mention this explicitly in > > the > > > > >>> release notes [1]. > > > > >>> > > > > >>> Regards, > > > > >>> Gordon > > > > >>> > > > > >>> [1] https://github.com/apache/flink/pull/9438 > > > > >>> > > > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther < > twal...@apache.org> > > > > wrote: > > > > >>> > > > > Hi Kurt, > > > > > > > > I agree that this is a serious bug. However, I would not block > the > > > > release because of this. As you said, there is a workaround and > > the > > > > `execute()` works in the most common case of a single execution. > > We > > > > can > > > > fix this in a minor release shortly after. > > > > > > > > What do others think? > > > > > > > > Regards, > > > > Timo > > > > > > > > > > > > Am 15.08.19 um 11:23 schrieb Kurt Young: > > > > > HI, > > > > > > > > > > We just find a serious bug around blink planner: > > > > > https://issues.apache.org/jira/browse/FLINK-13708 > > > > > When user reused the table environment instance, and call > > `execute` > > > > method > > > > > multiple times for > > > > > different sql, the later call will trigger the earlier ones to > be > > > > > re-executed. > > > > > > > > > > It's a serious bug but seems we also have a work around, which
Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
Big +1 for this feature. This FLIP can help improves at least the following two scenarios: - Temporary data peak when using Heap StateBackend - Heap State Backend has better performance than RocksDBStateBackend, especially on SATA disk. there are some guys ever told me that they increased the parallelism of operators(and use HeapStateBackend) other than use RocksDBStateBackend to get better performance. But increase parallelism will have some other problems, after this FLIP, we can run Flink Job with the same parallelism as RocksDBStateBackend and get better performance also. Best, Congxian Yu Li 于2019年8月16日周五 上午12:14写道: > Thanks all for the reviews and comments! > > bq. From the implementation plan, it looks like this exists purely in a new > module and does not require any changes in other parts of Flink's code. Can > you confirm that? > Confirmed, thanks! > > Best Regards, > Yu > > > On Thu, 15 Aug 2019 at 18:04, Tzu-Li (Gordon) Tai > wrote: > > > +1 to start a VOTE for this FLIP. > > > > Given the properties of this new state backend and that it will exist as > a > > new module without touching the original heap backend, I don't see a harm > > in including this. > > Regarding design of the feature, I've already mentioned my comments in > the > > original discussion thread. > > > > Cheers, > > Gordon > > > > On Thu, Aug 15, 2019 at 5:53 PM Yun Tang wrote: > > > > > Big +1 for this feature. > > > > > > Our customers including me, have ever met dilemma where we have to use > > > window to aggregate events in applications like real-time monitoring. > The > > > larger of timer and window state, the poor performance of RocksDB. > > However, > > > switching to use FsStateBackend would always make me feel fear about > the > > > OOM errors. > > > > > > Look forward for more powerful enrichment to state-backend, and help > > Flink > > > to achieve better performance together. > > > > > > Best > > > Yun Tang > > > > > > From: Stephan Ewen > > > Sent: Thursday, August 15, 2019 23:07 > > > To: dev > > > Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend > > > > > > +1 for this feature. I think this will be appreciated by users, as a > way > > to > > > use the HeapStateBackend with a safety-net against OOM errors. > > > And having had major production exposure is great. > > > > > > From the implementation plan, it looks like this exists purely in a new > > > module and does not require any changes in other parts of Flink's code. > > Can > > > you confirm that? > > > > > > Other that that, I have no further questions and we could proceed to > vote > > > on this FLIP, from my side. > > > > > > Best, > > > Stephan > > > > > > > > > On Tue, Aug 13, 2019 at 10:00 PM Yu Li wrote: > > > > > > > Sorry for forgetting to give the link of the FLIP, here it is: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend > > > > > > > > Thanks! > > > > > > > > Best Regards, > > > > Yu > > > > > > > > > > > > On Tue, 13 Aug 2019 at 18:06, Yu Li wrote: > > > > > > > > > Hi All, > > > > > > > > > > We ever held a discussion about this feature before [1] but now > > opening > > > > > another thread because after a second thought introducing a new > > backend > > > > > instead of modifying the existing heap backend is a better option > to > > > > > prevent causing any regression or surprise to existing > in-production > > > > usage. > > > > > And since introducing a new backend is relatively big change, we > > regard > > > > it > > > > > as a FLIP and need another discussion and voting process according > to > > > our > > > > > newly drafted bylaw [2]. > > > > > > > > > > Please allow me to quote the brief description from the old thread > > [1] > > > > for > > > > > the convenience of those who noticed this feature for the first > time: > > > > > > > > > > > > > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in > Flink, > > > > > since state lives as Java objects on the heap in > > HeapKeyedStateBackend > > > > and > > > > > the de/serialization only happens during state snapshot and > restore, > > it > > > > > outperforms RocksDBKeyeStateBackend when all data could reside in > > > > memory.**However, > > > > > along with the advantage, HeapKeyedStateBackend also has its > > > > shortcomings, > > > > > and the most painful one is the difficulty to estimate the maximum > > heap > > > > > size (Xmx) to set, and we will suffer from GC impact once the heap > > > memory > > > > > is not enough to hold all state data. There’re several (inevitable) > > > > causes > > > > > for such scenario, including (but not limited to):* > > > > > > > > > > > > > > > > > > > > ** Memory overhead of Java object representation (tens of times of > > the > > > > > serialized data size).* Data flood caused by burst traffic.* Data > > > > > accumulation caused by source malfunction.**To resolve this > problem, > > we
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
-1 for RC2. I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I think it's a blocker. The bug means currently if users call `tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru SQL would not be able to see Flink's built-in functions. I'm preparing a fix right now. Bowen On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai wrote: > Thanks for all the test efforts, verifications and votes so far. > > So far, things are looking good, but we still require one more PMC binding > vote for this RC to be the official release, so I would like to extend the > vote time for 1 more day, until *Aug. 16th 17:00 CET*. > > In the meantime, the release notes for 1.9.0 had only just been finalized > [1], and could use a few more eyes before closing the vote. > Any help with checking if anything else should be mentioned there regarding > breaking changes / known shortcomings would be appreciated. > > Cheers, > Gordon > > [1] https://github.com/apache/flink/pull/9438 > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young wrote: > > > Great, then I have no other comments on legal check. > > > > Best, > > Kurt > > > > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler > > wrote: > > > > > The licensing items aren't a problem; we don't care about Flink modules > > > in NOTICE files, and we don't have to update the source-release > > > licensing since we don't have a pre-built version of the WebUI in the > > > source. > > > > > > On 15/08/2019 15:22, Kurt Young wrote: > > > > After going through the licenses, I found 2 suspicions but not sure > if > > > they > > > > are > > > > valid or not. > > > > > > > > 1. flink-state-processing-api is packaged in to flink-dist jar, but > not > > > > included in > > > > NOTICE-binary file (the one under the root directory) like other > > modules. > > > > 2. flink-runtime-web distributed some JavaScript dependencies through > > > source > > > > codes, the licenses and NOTICE file were only updated inside the > module > > > of > > > > flink-runtime-web, but not the NOTICE file and licenses directory > which > > > > under > > > > the root directory. > > > > > > > > Another minor issue I just found is: > > > > FLINK-13558 tries to include table examples to flink-dist, but I > cannot > > > > find it in > > > > the binary distribution of RC2. > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young wrote: > > > > > > > >> Hi Gordon & Timo, > > > >> > > > >> Thanks for the feedback, and I agree with it. I will document this > in > > > the > > > >> release notes. > > > >> > > > >> Best, > > > >> Kurt > > > >> > > > >> > > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai < > > > tzuli...@apache.org> > > > >> wrote: > > > >> > > > >>> Hi Kurt, > > > >>> > > > >>> With the same argument as before, given that it is mentioned in the > > > >>> release > > > >>> announcement that it is a preview feature, I would not block this > > > release > > > >>> because of it. > > > >>> Nevertheless, it would be important to mention this explicitly in > the > > > >>> release notes [1]. > > > >>> > > > >>> Regards, > > > >>> Gordon > > > >>> > > > >>> [1] https://github.com/apache/flink/pull/9438 > > > >>> > > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther > > > wrote: > > > >>> > > > Hi Kurt, > > > > > > I agree that this is a serious bug. However, I would not block the > > > release because of this. As you said, there is a workaround and > the > > > `execute()` works in the most common case of a single execution. > We > > > can > > > fix this in a minor release shortly after. > > > > > > What do others think? > > > > > > Regards, > > > Timo > > > > > > > > > Am 15.08.19 um 11:23 schrieb Kurt Young: > > > > HI, > > > > > > > > We just find a serious bug around blink planner: > > > > https://issues.apache.org/jira/browse/FLINK-13708 > > > > When user reused the table environment instance, and call > `execute` > > > method > > > > multiple times for > > > > different sql, the later call will trigger the earlier ones to be > > > > re-executed. > > > > > > > > It's a serious bug but seems we also have a work around, which is > > > >>> never > > > > reuse the table environment > > > > object. I'm not sure if we should treat this one as blocker issue > > of > > > 1.9.0. > > > > What's your opinion? > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao > > wrote: > > > > > > > >> +1 (non-binding) > > > >> > > > >> Jepsen test suite passed 10 times consecutively > > > >> > > > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek < > > > >>> aljos...@apache.org> > > > >> wrote: > > > >> > > > >>> +1 > > > >>> > > > >>> I did some testing on a Google Cloud Dataproc cluster (it gives > >
[jira] [Created] (FLINK-13741) FunctionCatalog.getUserDefinedFunctions() does not return Flink built-in functions' names
Bowen Li created FLINK-13741: Summary: FunctionCatalog.getUserDefinedFunctions() does not return Flink built-in functions' names Key: FLINK-13741 URL: https://issues.apache.org/jira/browse/FLINK-13741 Project: Flink Issue Type: Bug Components: Table SQL / API Affects Versions: 1.9.0 Reporter: Bowen Li Assignee: Bowen Li Fix For: 1.9.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13740) TableAggregateITCase.testNonkeyedFlatAggregate failed on Travis
Till Rohrmann created FLINK-13740: - Summary: TableAggregateITCase.testNonkeyedFlatAggregate failed on Travis Key: FLINK-13740 URL: https://issues.apache.org/jira/browse/FLINK-13740 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.10.0 Reporter: Till Rohrmann Fix For: 1.10.0 The {{TableAggregateITCase.testNonkeyedFlatAggregate}} failed on Travis with {code} org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.table.planner.runtime.stream.table.TableAggregateITCase.testNonkeyedFlatAggregate(TableAggregateITCase.scala:93) Caused by: java.lang.Exception: Artificial Failure {code} https://api.travis-ci.com/v3/job/225551182/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
Thanks all for the reviews and comments! bq. From the implementation plan, it looks like this exists purely in a new module and does not require any changes in other parts of Flink's code. Can you confirm that? Confirmed, thanks! Best Regards, Yu On Thu, 15 Aug 2019 at 18:04, Tzu-Li (Gordon) Tai wrote: > +1 to start a VOTE for this FLIP. > > Given the properties of this new state backend and that it will exist as a > new module without touching the original heap backend, I don't see a harm > in including this. > Regarding design of the feature, I've already mentioned my comments in the > original discussion thread. > > Cheers, > Gordon > > On Thu, Aug 15, 2019 at 5:53 PM Yun Tang wrote: > > > Big +1 for this feature. > > > > Our customers including me, have ever met dilemma where we have to use > > window to aggregate events in applications like real-time monitoring. The > > larger of timer and window state, the poor performance of RocksDB. > However, > > switching to use FsStateBackend would always make me feel fear about the > > OOM errors. > > > > Look forward for more powerful enrichment to state-backend, and help > Flink > > to achieve better performance together. > > > > Best > > Yun Tang > > > > From: Stephan Ewen > > Sent: Thursday, August 15, 2019 23:07 > > To: dev > > Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend > > > > +1 for this feature. I think this will be appreciated by users, as a way > to > > use the HeapStateBackend with a safety-net against OOM errors. > > And having had major production exposure is great. > > > > From the implementation plan, it looks like this exists purely in a new > > module and does not require any changes in other parts of Flink's code. > Can > > you confirm that? > > > > Other that that, I have no further questions and we could proceed to vote > > on this FLIP, from my side. > > > > Best, > > Stephan > > > > > > On Tue, Aug 13, 2019 at 10:00 PM Yu Li wrote: > > > > > Sorry for forgetting to give the link of the FLIP, here it is: > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend > > > > > > Thanks! > > > > > > Best Regards, > > > Yu > > > > > > > > > On Tue, 13 Aug 2019 at 18:06, Yu Li wrote: > > > > > > > Hi All, > > > > > > > > We ever held a discussion about this feature before [1] but now > opening > > > > another thread because after a second thought introducing a new > backend > > > > instead of modifying the existing heap backend is a better option to > > > > prevent causing any regression or surprise to existing in-production > > > usage. > > > > And since introducing a new backend is relatively big change, we > regard > > > it > > > > as a FLIP and need another discussion and voting process according to > > our > > > > newly drafted bylaw [2]. > > > > > > > > Please allow me to quote the brief description from the old thread > [1] > > > for > > > > the convenience of those who noticed this feature for the first time: > > > > > > > > > > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink, > > > > since state lives as Java objects on the heap in > HeapKeyedStateBackend > > > and > > > > the de/serialization only happens during state snapshot and restore, > it > > > > outperforms RocksDBKeyeStateBackend when all data could reside in > > > memory.**However, > > > > along with the advantage, HeapKeyedStateBackend also has its > > > shortcomings, > > > > and the most painful one is the difficulty to estimate the maximum > heap > > > > size (Xmx) to set, and we will suffer from GC impact once the heap > > memory > > > > is not enough to hold all state data. There’re several (inevitable) > > > causes > > > > for such scenario, including (but not limited to):* > > > > > > > > > > > > > > > > ** Memory overhead of Java object representation (tens of times of > the > > > > serialized data size).* Data flood caused by burst traffic.* Data > > > > accumulation caused by source malfunction.**To resolve this problem, > we > > > > proposed a solution to support spilling state data to disk before > heap > > > > memory is exhausted. We will monitor the heap usage and choose the > > > coldest > > > > data to spill, and reload them when heap memory is regained after > data > > > > removing or TTL expiration, automatically. Furthermore, *to prevent > > > > causing unexpected regression to existing usage of > > HeapKeyedStateBackend, > > > > we plan to introduce a new SpillableHeapKeyedStateBackend and change > it > > > to > > > > default in future if proven to be stable. > > > > > > > > Please let us know your point of the feature and any comment is > > > > welcomed/appreciated. Thanks. > > > > > > > > [1] https://s.apache.org/pxeif > > > > [2] > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026 > > > > > > > > Best Regards, > > > > Yu > > > > > > > > > >
Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
+1 to start a VOTE for this FLIP. Given the properties of this new state backend and that it will exist as a new module without touching the original heap backend, I don't see a harm in including this. Regarding design of the feature, I've already mentioned my comments in the original discussion thread. Cheers, Gordon On Thu, Aug 15, 2019 at 5:53 PM Yun Tang wrote: > Big +1 for this feature. > > Our customers including me, have ever met dilemma where we have to use > window to aggregate events in applications like real-time monitoring. The > larger of timer and window state, the poor performance of RocksDB. However, > switching to use FsStateBackend would always make me feel fear about the > OOM errors. > > Look forward for more powerful enrichment to state-backend, and help Flink > to achieve better performance together. > > Best > Yun Tang > > From: Stephan Ewen > Sent: Thursday, August 15, 2019 23:07 > To: dev > Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend > > +1 for this feature. I think this will be appreciated by users, as a way to > use the HeapStateBackend with a safety-net against OOM errors. > And having had major production exposure is great. > > From the implementation plan, it looks like this exists purely in a new > module and does not require any changes in other parts of Flink's code. Can > you confirm that? > > Other that that, I have no further questions and we could proceed to vote > on this FLIP, from my side. > > Best, > Stephan > > > On Tue, Aug 13, 2019 at 10:00 PM Yu Li wrote: > > > Sorry for forgetting to give the link of the FLIP, here it is: > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend > > > > Thanks! > > > > Best Regards, > > Yu > > > > > > On Tue, 13 Aug 2019 at 18:06, Yu Li wrote: > > > > > Hi All, > > > > > > We ever held a discussion about this feature before [1] but now opening > > > another thread because after a second thought introducing a new backend > > > instead of modifying the existing heap backend is a better option to > > > prevent causing any regression or surprise to existing in-production > > usage. > > > And since introducing a new backend is relatively big change, we regard > > it > > > as a FLIP and need another discussion and voting process according to > our > > > newly drafted bylaw [2]. > > > > > > Please allow me to quote the brief description from the old thread [1] > > for > > > the convenience of those who noticed this feature for the first time: > > > > > > > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink, > > > since state lives as Java objects on the heap in HeapKeyedStateBackend > > and > > > the de/serialization only happens during state snapshot and restore, it > > > outperforms RocksDBKeyeStateBackend when all data could reside in > > memory.**However, > > > along with the advantage, HeapKeyedStateBackend also has its > > shortcomings, > > > and the most painful one is the difficulty to estimate the maximum heap > > > size (Xmx) to set, and we will suffer from GC impact once the heap > memory > > > is not enough to hold all state data. There’re several (inevitable) > > causes > > > for such scenario, including (but not limited to):* > > > > > > > > > > > > ** Memory overhead of Java object representation (tens of times of the > > > serialized data size).* Data flood caused by burst traffic.* Data > > > accumulation caused by source malfunction.**To resolve this problem, we > > > proposed a solution to support spilling state data to disk before heap > > > memory is exhausted. We will monitor the heap usage and choose the > > coldest > > > data to spill, and reload them when heap memory is regained after data > > > removing or TTL expiration, automatically. Furthermore, *to prevent > > > causing unexpected regression to existing usage of > HeapKeyedStateBackend, > > > we plan to introduce a new SpillableHeapKeyedStateBackend and change it > > to > > > default in future if proven to be stable. > > > > > > Please let us know your point of the feature and any comment is > > > welcomed/appreciated. Thanks. > > > > > > [1] https://s.apache.org/pxeif > > > [2] > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026 > > > > > > Best Regards, > > > Yu > > > > > >
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Thanks for all the test efforts, verifications and votes so far. So far, things are looking good, but we still require one more PMC binding vote for this RC to be the official release, so I would like to extend the vote time for 1 more day, until *Aug. 16th 17:00 CET*. In the meantime, the release notes for 1.9.0 had only just been finalized [1], and could use a few more eyes before closing the vote. Any help with checking if anything else should be mentioned there regarding breaking changes / known shortcomings would be appreciated. Cheers, Gordon [1] https://github.com/apache/flink/pull/9438 On Thu, Aug 15, 2019 at 3:58 PM Kurt Young wrote: > Great, then I have no other comments on legal check. > > Best, > Kurt > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler > wrote: > > > The licensing items aren't a problem; we don't care about Flink modules > > in NOTICE files, and we don't have to update the source-release > > licensing since we don't have a pre-built version of the WebUI in the > > source. > > > > On 15/08/2019 15:22, Kurt Young wrote: > > > After going through the licenses, I found 2 suspicions but not sure if > > they > > > are > > > valid or not. > > > > > > 1. flink-state-processing-api is packaged in to flink-dist jar, but not > > > included in > > > NOTICE-binary file (the one under the root directory) like other > modules. > > > 2. flink-runtime-web distributed some JavaScript dependencies through > > source > > > codes, the licenses and NOTICE file were only updated inside the module > > of > > > flink-runtime-web, but not the NOTICE file and licenses directory which > > > under > > > the root directory. > > > > > > Another minor issue I just found is: > > > FLINK-13558 tries to include table examples to flink-dist, but I cannot > > > find it in > > > the binary distribution of RC2. > > > > > > Best, > > > Kurt > > > > > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young wrote: > > > > > >> Hi Gordon & Timo, > > >> > > >> Thanks for the feedback, and I agree with it. I will document this in > > the > > >> release notes. > > >> > > >> Best, > > >> Kurt > > >> > > >> > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai < > > tzuli...@apache.org> > > >> wrote: > > >> > > >>> Hi Kurt, > > >>> > > >>> With the same argument as before, given that it is mentioned in the > > >>> release > > >>> announcement that it is a preview feature, I would not block this > > release > > >>> because of it. > > >>> Nevertheless, it would be important to mention this explicitly in the > > >>> release notes [1]. > > >>> > > >>> Regards, > > >>> Gordon > > >>> > > >>> [1] https://github.com/apache/flink/pull/9438 > > >>> > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther > > wrote: > > >>> > > Hi Kurt, > > > > I agree that this is a serious bug. However, I would not block the > > release because of this. As you said, there is a workaround and the > > `execute()` works in the most common case of a single execution. We > > can > > fix this in a minor release shortly after. > > > > What do others think? > > > > Regards, > > Timo > > > > > > Am 15.08.19 um 11:23 schrieb Kurt Young: > > > HI, > > > > > > We just find a serious bug around blink planner: > > > https://issues.apache.org/jira/browse/FLINK-13708 > > > When user reused the table environment instance, and call `execute` > > method > > > multiple times for > > > different sql, the later call will trigger the earlier ones to be > > > re-executed. > > > > > > It's a serious bug but seems we also have a work around, which is > > >>> never > > > reuse the table environment > > > object. I'm not sure if we should treat this one as blocker issue > of > > 1.9.0. > > > What's your opinion? > > > > > > Best, > > > Kurt > > > > > > > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao > wrote: > > > > > >> +1 (non-binding) > > >> > > >> Jepsen test suite passed 10 times consecutively > > >> > > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek < > > >>> aljos...@apache.org> > > >> wrote: > > >> > > >>> +1 > > >>> > > >>> I did some testing on a Google Cloud Dataproc cluster (it gives > you > > >>> a > > >>> managed YARN and Google Cloud Storage (GCS)): > > >>> - tried both YARN session mode and YARN per-job mode, also > > using > > >>> bin/flink list/cancel/etc. against a YARN session cluster > > >>> - ran examples that write to GCS, both with the native Hadoop > > >> FileSystem > > >>> and a custom “plugin” FileSystem > > >>> - ran stateful streaming jobs that use GCS as a checkpoint > > >>> backend > > >>> - tried running SQL programs on YARN using the SQL Cli: this > > >>> worked > > for > > >>> YARN session mode but not for YARN per-job mode. Looking at the > > >>> code I > > >>> don’t thi
Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
Big +1 for this feature. Our customers including me, have ever met dilemma where we have to use window to aggregate events in applications like real-time monitoring. The larger of timer and window state, the poor performance of RocksDB. However, switching to use FsStateBackend would always make me feel fear about the OOM errors. Look forward for more powerful enrichment to state-backend, and help Flink to achieve better performance together. Best Yun Tang From: Stephan Ewen Sent: Thursday, August 15, 2019 23:07 To: dev Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend +1 for this feature. I think this will be appreciated by users, as a way to use the HeapStateBackend with a safety-net against OOM errors. And having had major production exposure is great. >From the implementation plan, it looks like this exists purely in a new module and does not require any changes in other parts of Flink's code. Can you confirm that? Other that that, I have no further questions and we could proceed to vote on this FLIP, from my side. Best, Stephan On Tue, Aug 13, 2019 at 10:00 PM Yu Li wrote: > Sorry for forgetting to give the link of the FLIP, here it is: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend > > Thanks! > > Best Regards, > Yu > > > On Tue, 13 Aug 2019 at 18:06, Yu Li wrote: > > > Hi All, > > > > We ever held a discussion about this feature before [1] but now opening > > another thread because after a second thought introducing a new backend > > instead of modifying the existing heap backend is a better option to > > prevent causing any regression or surprise to existing in-production > usage. > > And since introducing a new backend is relatively big change, we regard > it > > as a FLIP and need another discussion and voting process according to our > > newly drafted bylaw [2]. > > > > Please allow me to quote the brief description from the old thread [1] > for > > the convenience of those who noticed this feature for the first time: > > > > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink, > > since state lives as Java objects on the heap in HeapKeyedStateBackend > and > > the de/serialization only happens during state snapshot and restore, it > > outperforms RocksDBKeyeStateBackend when all data could reside in > memory.**However, > > along with the advantage, HeapKeyedStateBackend also has its > shortcomings, > > and the most painful one is the difficulty to estimate the maximum heap > > size (Xmx) to set, and we will suffer from GC impact once the heap memory > > is not enough to hold all state data. There’re several (inevitable) > causes > > for such scenario, including (but not limited to):* > > > > > > > > ** Memory overhead of Java object representation (tens of times of the > > serialized data size).* Data flood caused by burst traffic.* Data > > accumulation caused by source malfunction.**To resolve this problem, we > > proposed a solution to support spilling state data to disk before heap > > memory is exhausted. We will monitor the heap usage and choose the > coldest > > data to spill, and reload them when heap memory is regained after data > > removing or TTL expiration, automatically. Furthermore, *to prevent > > causing unexpected regression to existing usage of HeapKeyedStateBackend, > > we plan to introduce a new SpillableHeapKeyedStateBackend and change it > to > > default in future if proven to be stable. > > > > Please let us know your point of the feature and any comment is > > welcomed/appreciated. Thanks. > > > > [1] https://s.apache.org/pxeif > > [2] > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026 > > > > Best Regards, > > Yu > > >
Re: [DISCUSS] FLIP-48: Pluggable Intermediate Result Storage
Sorry for the late response. So many FLIPs these days. I am a bit unsure about the motivation here, and that this need to be a part of Flink. It sounds like this can be perfectly built around Flink as a minimal library on top of it, without any change in the core APIs or runtime. The proposal to handle "caching intermediate results" (to make them reusable across jobs in a session), and "writing them in different formats / indexing them" doesn't sound like it should be the same mechanism. - The caching part is a transparent low-level primitive. It avoid re-executing a part of the job graph, but otherwise is completely transparent to the consumer job. - Writing data out in a sink, compressing/indexing it and then reading it in another job is also a way of reusing a previous result, but on a completely different abstraction level. It is not the same intermediate result any more. When the consumer reads from it and applies predicate pushdown, etc. then the consumer job looks completely different from a job that consumed the original result. It hence needs to be solved on the API level via a sink and a source. I would suggest to keep these concepts separate: Caching (possibly automatically) for jobs in a session, and long term writing/sharing of data sets. Solving the "long term writing/sharing" in a library rather than in the runtime also has the advantage of not pushing yet more stuff into Flink's core, which I believe is also an important criterion. Best, Stephan On Thu, Jul 25, 2019 at 4:53 AM Xuannan Su wrote: > Hi folks, > > I would like to start the FLIP discussion thread about the pluggable > intermediate result storage. > > This is phase 2 of FLIP-36: Support Interactive Programming in Flink Skip > to end of metadata. While the FLIP-36 provides a default implementation of > the intermediate result storage using the shuffle service, we would like to > make the intermediate result storage pluggable so that the user can easily > swap the storage. > > We are looking forward to your thought! > > The FLIP link is the following: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-48%3A+Pluggable+Intermediate+Result+Storage > < > https://cwiki.apache.org/confluence/display/FLINK/FLIP-48:+Pluggable+Intermediate+Result+Storage > > > > Best, > Xuannan >
[jira] [Created] (FLINK-13739) BinaryRowTest.testWriteString() fails in some environments
Robert Metzger created FLINK-13739: -- Summary: BinaryRowTest.testWriteString() fails in some environments Key: FLINK-13739 URL: https://issues.apache.org/jira/browse/FLINK-13739 Project: Flink Issue Type: Task Components: Table SQL / Runtime Affects Versions: 1.9.0 Environment: Reporter: Robert Metzger {code:java} Test set: org.apache.flink.table.dataformat.BinaryRowTest --- Tests run: 26, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.328 s <<< FAILURE! - in org.apache.flink.table.dataformat.BinaryRowTest testWriteString(org.apache.flink.table.dataformat.BinaryRowTest) Time elapsed: 0.05 s <<< FAILURE! org.junit.ComparisonFailure: expected:<[<95><95><95><95><95><88><91><98> <90><9A><84><89><88><8C>]> but was:<[?]> at org.apache.flink.table.dataformat.BinaryRowTest.testWriteString(BinaryRowTest.java:189) {code} This error happens on a Google Cloud n2-standard-16 (16 vCPUs, 64 GB memory) machine. {code}$ lsb_release -a No LSB modules are available. Distributor ID:Debian Description:Debian GNU/Linux 9.9 (stretch) Release:9.9 Codename:stretch{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
+1 for this feature. I think this will be appreciated by users, as a way to use the HeapStateBackend with a safety-net against OOM errors. And having had major production exposure is great. >From the implementation plan, it looks like this exists purely in a new module and does not require any changes in other parts of Flink's code. Can you confirm that? Other that that, I have no further questions and we could proceed to vote on this FLIP, from my side. Best, Stephan On Tue, Aug 13, 2019 at 10:00 PM Yu Li wrote: > Sorry for forgetting to give the link of the FLIP, here it is: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend > > Thanks! > > Best Regards, > Yu > > > On Tue, 13 Aug 2019 at 18:06, Yu Li wrote: > > > Hi All, > > > > We ever held a discussion about this feature before [1] but now opening > > another thread because after a second thought introducing a new backend > > instead of modifying the existing heap backend is a better option to > > prevent causing any regression or surprise to existing in-production > usage. > > And since introducing a new backend is relatively big change, we regard > it > > as a FLIP and need another discussion and voting process according to our > > newly drafted bylaw [2]. > > > > Please allow me to quote the brief description from the old thread [1] > for > > the convenience of those who noticed this feature for the first time: > > > > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink, > > since state lives as Java objects on the heap in HeapKeyedStateBackend > and > > the de/serialization only happens during state snapshot and restore, it > > outperforms RocksDBKeyeStateBackend when all data could reside in > memory.**However, > > along with the advantage, HeapKeyedStateBackend also has its > shortcomings, > > and the most painful one is the difficulty to estimate the maximum heap > > size (Xmx) to set, and we will suffer from GC impact once the heap memory > > is not enough to hold all state data. There’re several (inevitable) > causes > > for such scenario, including (but not limited to):* > > > > > > > > ** Memory overhead of Java object representation (tens of times of the > > serialized data size).* Data flood caused by burst traffic.* Data > > accumulation caused by source malfunction.**To resolve this problem, we > > proposed a solution to support spilling state data to disk before heap > > memory is exhausted. We will monitor the heap usage and choose the > coldest > > data to spill, and reload them when heap memory is regained after data > > removing or TTL expiration, automatically. Furthermore, *to prevent > > causing unexpected regression to existing usage of HeapKeyedStateBackend, > > we plan to introduce a new SpillableHeapKeyedStateBackend and change it > to > > default in future if proven to be stable. > > > > Please let us know your point of the feature and any comment is > > welcomed/appreciated. Thanks. > > > > [1] https://s.apache.org/pxeif > > [2] > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026 > > > > Best Regards, > > Yu > > >
[jira] [Created] (FLINK-13738) NegativeArraySizeException in LongHybridHashTable
Robert Metzger created FLINK-13738: -- Summary: NegativeArraySizeException in LongHybridHashTable Key: FLINK-13738 URL: https://issues.apache.org/jira/browse/FLINK-13738 Project: Flink Issue Type: Task Components: Table SQL / Runtime Affects Versions: 1.9.0 Reporter: Robert Metzger Executing this (meaningless) query: {code:java} INSERT INTO sinkTable ( SELECT CONCAT( CAST( id AS VARCHAR), CAST( COUNT(*) AS VARCHAR)) as something, 'const' FROM CsvTable, table1 WHERE sometxt LIKE 'a%' AND id = key GROUP BY id ) {code} leads to the following exception: {code:java} Caused by: java.lang.NegativeArraySizeException at org.apache.flink.table.runtime.hashtable.LongHybridHashTable.tryDenseMode(LongHybridHashTable.java:216) at org.apache.flink.table.runtime.hashtable.LongHybridHashTable.endBuild(LongHybridHashTable.java:105) at LongHashJoinOperator$36.endInput1$(Unknown Source) at LongHashJoinOperator$36.endInput(Unknown Source) at org.apache.flink.streaming.runtime.tasks.OperatorChain.endInput(OperatorChain.java:256) at org.apache.flink.streaming.runtime.io.StreamTwoInputSelectableProcessor.checkFinished(StreamTwoInputSelectableProcessor.java:359) at org.apache.flink.streaming.runtime.io.StreamTwoInputSelectableProcessor.processInput(StreamTwoInputSelectableProcessor.java:193) at org.apache.flink.streaming.runtime.tasks.StreamTask.performDefaultAction(StreamTask.java:276) at org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:298) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:403) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:687) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:517) at java.lang.Thread.run(Thread.java:748){code} This is the plan: {code:java} == Abstract Syntax Tree == LogicalSink(name=[sinkTable], fields=[f0, f1]) +- LogicalProject(something=[CONCAT(CAST($0):VARCHAR(2147483647) CHARACTER SET "UTF-16LE", CAST($1):VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL)], EXPR$1=[_UTF-16LE'const']) +- LogicalAggregate(group=[ {0} ], agg#0=[COUNT()]) +- LogicalProject(id=[$1]) +- LogicalFilter(condition=[AND(LIKE($0, _UTF-16LE'a%'), =($1, CAST($2):BIGINT))]) +- LogicalJoin(condition=[true], joinType=[inner]) :- LogicalTableScan(table=[[default_catalog, default_database, CsvTable, source: [CsvTableSource(read fields: sometxt, id)]]]) +- LogicalTableScan(table=[[default_catalog, default_database, table1, source: [GeneratorTableSource(key, rowtime, payload)]]]) == Optimized Logical Plan == Sink(name=[sinkTable], fields=[f0, f1]): rowcount = 1498810.6659336376, cumulative cost = {4.459964319978008E8 rows, 1.879799762133187E10 cpu, 4.8E9 io, 8.4E8 network, 1.799524266373455E8 memory} +- Calc(select=[CONCAT(CAST(id), CAST($f1)) AS something, _UTF-16LE'const' AS EXPR$1]): rowcount = 1498810.6659336376, cumulative cost = {4.444976213318672E8 rows, 1.8796498810665936E10 cpu, 4.8E9 io, 8.4E8 network, 1.799524266373455E8 memory} +- HashAggregate(isMerge=[false], groupBy=[id], select=[id, COUNT(*) AS $f1]): rowcount = 1498810.6659336376, cumulative cost = {4.429988106659336E8 rows, 1.8795E10 cpu, 4.8E9 io, 8.4E8 network, 1.799524266373455E8 memory} +- Calc(select=[id]): rowcount = 1.575E7, cumulative cost = {4.415E8 rows, 1.848E10 cpu, 4.8E9 io, 8.4E8 network, 1.2E8 memory} +- HashJoin(joinType=[InnerJoin], where=[=(id, key0)], select=[id, key0], build=[left]): rowcount = 1.575E7, cumulative cost = {4.2575E8 rows, 1.848E10 cpu, 4.8E9 io, 8.4E8 network, 1.2E8 memory} :- Exchange(distribution=[hash[id]]): rowcount = 500.0, cumulative cost = {1.1E8 rows, 8.4E8 cpu, 2.0E9 io, 4.0E7 network, 0.0 memory} : +- Calc(select=[id], where=[LIKE(sometxt, _UTF-16LE'a%')]): rowcount = 500.0, cumulative cost = {1.05E8 rows, 0.0 cpu, 2.0E9 io, 0.0 network, 0.0 memory} : +- TableSourceScan(table=[[default_catalog, default_database, CsvTable, source: [CsvTableSource(read fields: sometxt, id)]]], fields=[sometxt, id]): rowcount = 1.0E8, cumulative cost = {1.0E8 rows, 0.0 cpu, 2.0E9 io, 0.0 network, 0.0 memory} +- Exchange(distribution=[hash[key0]]): rowcount = 1.0E8, cumulative cost = {3.0E8 rows, 1.68E10 cpu, 2.8E9 io, 8.0E8 network, 0.0 memory} +- Calc(select=[CAST(key) AS key0]): rowcount = 1.0E8, cumulative cost = {2.0E8 rows, 0.0 cpu, 2.8E9 io, 0.0 network, 0.0 memory} +- TableSourceScan(table=[[default_catalog, default_database, table1, source: [GeneratorTableSource(key, rowtime, payload)]]], fields=[key, rowtime, payload]): rowcount = 1.0E8, cumulative cost = {1.0E8 rows, 0.0 cpu, 2.8E9 io, 0.0 network, 0.0 memory} == Physical Execution Plan == Stage 1 : Data Source content : collect elements with CollectionInputFormat Stage 2 : Operator content : CsvTableSource(read fields: sometxt, id) ship_strategy : REBALANCE Stage 3 : Operator content : SourceConversi
[DISCUSS] FLIP-53: Fine Grained Resource Management
Hi everyone, We would like to start a discussion thread on "FLIP-53: Fine Grained Resource Management"[1], where we propose how to improve Flink resource management and scheduling. This FLIP mainly discusses the following issues. - How to support tasks with fine grained resource requirements. - How to unify resource management for jobs with / without fine grained resource requirements. - How to unify resource management for streaming / batch jobs. Key changes proposed in the FLIP are as follows. - Unify memory management for operators with / without fine grained resource requirements by applying a fraction based quota mechanism. - Unify resource scheduling for streaming and batch jobs by setting slot sharing groups for pipelined regions during compiling stage. - Dynamically allocate slots from task executors' available resources. Please find more details in the FLIP wiki document [1]. Looking forward to your feedbacks. Thank you~ Xintong Song [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management
Re: [VOTE] FLIP-51: Rework of the Expression Design
+1 for this. Thanks, Timo Am 15.08.19 um 15:57 schrieb JingsongLee: Hi Flink devs, I would like to start the voting for FLIP-51 Rework of the Expression Design. FLIP wiki: https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html Google Doc: https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing Thanks, Best, Jingsong Lee
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Great, then I have no other comments on legal check. Best, Kurt On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler wrote: > The licensing items aren't a problem; we don't care about Flink modules > in NOTICE files, and we don't have to update the source-release > licensing since we don't have a pre-built version of the WebUI in the > source. > > On 15/08/2019 15:22, Kurt Young wrote: > > After going through the licenses, I found 2 suspicions but not sure if > they > > are > > valid or not. > > > > 1. flink-state-processing-api is packaged in to flink-dist jar, but not > > included in > > NOTICE-binary file (the one under the root directory) like other modules. > > 2. flink-runtime-web distributed some JavaScript dependencies through > source > > codes, the licenses and NOTICE file were only updated inside the module > of > > flink-runtime-web, but not the NOTICE file and licenses directory which > > under > > the root directory. > > > > Another minor issue I just found is: > > FLINK-13558 tries to include table examples to flink-dist, but I cannot > > find it in > > the binary distribution of RC2. > > > > Best, > > Kurt > > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young wrote: > > > >> Hi Gordon & Timo, > >> > >> Thanks for the feedback, and I agree with it. I will document this in > the > >> release notes. > >> > >> Best, > >> Kurt > >> > >> > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai < > tzuli...@apache.org> > >> wrote: > >> > >>> Hi Kurt, > >>> > >>> With the same argument as before, given that it is mentioned in the > >>> release > >>> announcement that it is a preview feature, I would not block this > release > >>> because of it. > >>> Nevertheless, it would be important to mention this explicitly in the > >>> release notes [1]. > >>> > >>> Regards, > >>> Gordon > >>> > >>> [1] https://github.com/apache/flink/pull/9438 > >>> > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther > wrote: > >>> > Hi Kurt, > > I agree that this is a serious bug. However, I would not block the > release because of this. As you said, there is a workaround and the > `execute()` works in the most common case of a single execution. We > can > fix this in a minor release shortly after. > > What do others think? > > Regards, > Timo > > > Am 15.08.19 um 11:23 schrieb Kurt Young: > > HI, > > > > We just find a serious bug around blink planner: > > https://issues.apache.org/jira/browse/FLINK-13708 > > When user reused the table environment instance, and call `execute` > method > > multiple times for > > different sql, the later call will trigger the earlier ones to be > > re-executed. > > > > It's a serious bug but seems we also have a work around, which is > >>> never > > reuse the table environment > > object. I'm not sure if we should treat this one as blocker issue of > 1.9.0. > > What's your opinion? > > > > Best, > > Kurt > > > > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: > > > >> +1 (non-binding) > >> > >> Jepsen test suite passed 10 times consecutively > >> > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek < > >>> aljos...@apache.org> > >> wrote: > >> > >>> +1 > >>> > >>> I did some testing on a Google Cloud Dataproc cluster (it gives you > >>> a > >>> managed YARN and Google Cloud Storage (GCS)): > >>> - tried both YARN session mode and YARN per-job mode, also > using > >>> bin/flink list/cancel/etc. against a YARN session cluster > >>> - ran examples that write to GCS, both with the native Hadoop > >> FileSystem > >>> and a custom “plugin” FileSystem > >>> - ran stateful streaming jobs that use GCS as a checkpoint > >>> backend > >>> - tried running SQL programs on YARN using the SQL Cli: this > >>> worked > for > >>> YARN session mode but not for YARN per-job mode. Looking at the > >>> code I > >>> don’t think per-job mode would work from seeing how it is > >>> implemented. > >> But > >>> I think it’s an OK restriction to have for now > >>> - in all the testing I had fine-grained recovery (region > >>> failover) > >>> enabled but I didn’t simulate any failures > >>> > On 14. Aug 2019, at 15:20, Kurt Young wrote: > > Hi, > > Thanks for preparing this release candidate. I have verified the > >>> following: > - verified the checksums and GPG files match the corresponding > >>> release > >>> files > - verified that the source archives do not contains any binaries > - build the source release with Scala 2.11 successfully. > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and > >>> [FLINK-13688], > but > both are not release blockers. Other than that, all tests are > >>> passed. > - ra
[VOTE] FLIP-51: Rework of the Expression Design
Hi Flink devs, I would like to start the voting for FLIP-51 Rework of the Expression Design. FLIP wiki: https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design Discussion thread: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html Google Doc: https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing Thanks, Best, Jingsong Lee
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
The licensing items aren't a problem; we don't care about Flink modules in NOTICE files, and we don't have to update the source-release licensing since we don't have a pre-built version of the WebUI in the source. On 15/08/2019 15:22, Kurt Young wrote: After going through the licenses, I found 2 suspicions but not sure if they are valid or not. 1. flink-state-processing-api is packaged in to flink-dist jar, but not included in NOTICE-binary file (the one under the root directory) like other modules. 2. flink-runtime-web distributed some JavaScript dependencies through source codes, the licenses and NOTICE file were only updated inside the module of flink-runtime-web, but not the NOTICE file and licenses directory which under the root directory. Another minor issue I just found is: FLINK-13558 tries to include table examples to flink-dist, but I cannot find it in the binary distribution of RC2. Best, Kurt On Thu, Aug 15, 2019 at 6:19 PM Kurt Young wrote: Hi Gordon & Timo, Thanks for the feedback, and I agree with it. I will document this in the release notes. Best, Kurt On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai wrote: Hi Kurt, With the same argument as before, given that it is mentioned in the release announcement that it is a preview feature, I would not block this release because of it. Nevertheless, it would be important to mention this explicitly in the release notes [1]. Regards, Gordon [1] https://github.com/apache/flink/pull/9438 On Thu, Aug 15, 2019 at 11:29 AM Timo Walther wrote: Hi Kurt, I agree that this is a serious bug. However, I would not block the release because of this. As you said, there is a workaround and the `execute()` works in the most common case of a single execution. We can fix this in a minor release shortly after. What do others think? Regards, Timo Am 15.08.19 um 11:23 schrieb Kurt Young: HI, We just find a serious bug around blink planner: https://issues.apache.org/jira/browse/FLINK-13708 When user reused the table environment instance, and call `execute` method multiple times for different sql, the later call will trigger the earlier ones to be re-executed. It's a serious bug but seems we also have a work around, which is never reuse the table environment object. I'm not sure if we should treat this one as blocker issue of 1.9.0. What's your opinion? Best, Kurt On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: +1 (non-binding) Jepsen test suite passed 10 times consecutively On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek < aljos...@apache.org> wrote: +1 I did some testing on a Google Cloud Dataproc cluster (it gives you a managed YARN and Google Cloud Storage (GCS)): - tried both YARN session mode and YARN per-job mode, also using bin/flink list/cancel/etc. against a YARN session cluster - ran examples that write to GCS, both with the native Hadoop FileSystem and a custom “plugin” FileSystem - ran stateful streaming jobs that use GCS as a checkpoint backend - tried running SQL programs on YARN using the SQL Cli: this worked for YARN session mode but not for YARN per-job mode. Looking at the code I don’t think per-job mode would work from seeing how it is implemented. But I think it’s an OK restriction to have for now - in all the testing I had fine-grained recovery (region failover) enabled but I didn’t simulate any failures On 14. Aug 2019, at 15:20, Kurt Young wrote: Hi, Thanks for preparing this release candidate. I have verified the following: - verified the checksums and GPG files match the corresponding release files - verified that the source archives do not contains any binaries - build the source release with Scala 2.11 successfully. - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and [FLINK-13688], but both are not release blockers. Other than that, all tests are passed. - ran all e2e tests which don't need download external packages (it's very unstable in China and almost impossible to download them), all passed. - started local cluster, ran some examples. Met a small website display issue [FLINK-13591], which is also not a release blocker. Although we have pushed some fixes around blink planner and hive integration after RC2, but consider these are both preview features, I'm lean to be ok to release without these fixes. +1 from my side. (binding) Best, Kurt On Wed, Aug 14, 2019 at 5:13 PM Jark Wu wrote: Hi Gordon, I have verified the following things: - build the source release with Scala 2.12 and Scala 2.11 successfully - checked/verified signatures and hashes - checked that all POM files point to the same version - ran some flink table related end-to-end tests locally and succeeded (except TPC-H e2e failed which is reported in FLINK-13704) - started cluster for both Scala 2.11 and 2.12, ran examples, verified web ui and log output, nothing unexpected - started cluster, ran a SQL query to temporal join with kafka so
Re: [DISCUSS] Reducing build times
Hi all! Thanks for starting this discussion. I'd like to also add my 2 cents: +1 for #2, differential build scripts. I've worked on the approach. And with it, I think it's possible to reduce total build time with relatively low effort, without enforcing any new build tool and low maintenance cost. You can check a proposed change (for the old CI setup, when Flink PRs were running in Apache common CI pool) here: https://github.com/apache/flink/pull/9065 In the proposed change, the dependency check is not heavily hardcoded and just uses maven's results for dependency graph analysis. > This approach is conceptually quite straight-forward, but has limits since it has to be pessimistic; > i.e. a change in flink-core _must_ result in testing all modules. Agree, in Flink case, there are some core modules that would trigger whole tests run with such approach. For developers who modify such components, the build time would be the longest. But this approach should really help for developers who touch more-or-less independent modules. Even for core modules, it's possible to create "abstraction" barriers by changing dependency graph. For example, it can look like: flink-core-api <-- flink-core, flink-core-api <-- flink-connectors. In that case, only change in flink-core-api would trigger whole tests run. +1 for #3, separating PR CI runs to different stages. Imo, it may require more change to current CI setup, compared to #2 and better it should not be silly. Best, if it integrates with the Flink bot and triggers some follow up build steps only when some prerequisites are done. +1 for #4, to move some tests into cron runs. But imo, this does not scale well, it applies only to a small subset of tests. +1 for #6, to use other CI service(s). More specifically, GitHub gives build actions for free that can be used to offload some build steps/PR checks. It can help to move out some PR checks from the main CI build (for example: documentation builds, license checks, code formatting checks). Regards, Aleksey On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann wrote: > Thanks for starting this discussion Chesnay. I think it has become obvious > to the Flink community that with the existing build setup we cannot really > deliver fast build times which are essential for fast iteration cycles and > high developer productivity. The reasons for this situation are manifold > but it is definitely affected by Flink's project growth, not always optimal > tests and the inflexibility that everything needs to be built. Hence, I > consider the reduction of build times crucial for the project's health and > future growth. > > Without necessarily voicing a strong preference for any of the presented > suggestions, I wanted to comment on each of them: > > 1. This sounds promising. Could the reason why we don't reuse JVMs date > back to the time when we still had a lot of static fields in Flink which > made it hard to reuse JVMs and the potentially mutated global state? > > 2. Building hand-crafted solutions around a build system in order to > compensate for its limitations which other build systems support out of the > box sounds like the not invented here syndrome to me. Reinventing the wheel > has historically proven to be usually not the best solution and it often > comes with a high maintenance price tag. Moreover, it would add just > another layer of complexity around our existing build system. I think the > current state where we have the maven setup in pom files and for Travis > multiple bash scripts specializing the builds to make it fit the time limit > is already not very transparent/easy to understand. > > 3. I could see this work but it also requires a very good understanding of > Flink of every committer because the committer needs to know which tests > would be good to run additionally. > > 4. I would be against this option solely to decrease our build time. My > observation is that the community does not monitor the health of the cron > jobs well enough. In the past the cron jobs have been unstable for as long > as a complete release cycle. Moreover, I've seen that PRs were merged which > passed Travis but broke the cron jobs. Consequently, I fear that this > option would deteriorate Flink's stability. > > 5. I would rephrase this point into changing the build system. Gradle could > be one candidate but there are also other build systems out there like > Bazel. Changing the build system would indeed be a major endeavour but I > could see the long term benefits of such a change (similar to having a > consistent and enforced code style) in particular if the build system > supports the functionality which we would otherwise build & maintain on our > own. I think there would be ways to make the transition not as disruptive > as described. For example, one could keep the Maven build and the new build > side by side until one is confident enough that the new build produces the > same output as the Maven build. Maybe it would al
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Thanks Kurt for checking that. The mentioned problem with table-examples is that, when working on FLINK-13558, I forgot to add dependency on flink-examples-table to flink-dist. So this module is not built if only the flink-dist with its dependencies is built (this happens in the release scripts: -pl flink-dist -am) I created FLINK-13737 to fix that. As those are only examples I wouldn't block the release on them. We might need to change the fixVersion of the mentioned FLINK-13558 not to confuse users. The proper fix we could include in 1.9.1. WDYT? Best, Dawid [1] https://issues.apache.org/jira/browse/FLINK-13737 On 15/08/2019 15:22, Kurt Young wrote: > After going through the licenses, I found 2 suspicions but not sure if they > are > valid or not. > > 1. flink-state-processing-api is packaged in to flink-dist jar, but not > included in > NOTICE-binary file (the one under the root directory) like other modules. > 2. flink-runtime-web distributed some JavaScript dependencies through source > codes, the licenses and NOTICE file were only updated inside the module of > flink-runtime-web, but not the NOTICE file and licenses directory which > under > the root directory. > > Another minor issue I just found is: > FLINK-13558 tries to include table examples to flink-dist, but I cannot > find it in > the binary distribution of RC2. > > Best, > Kurt > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young wrote: > >> Hi Gordon & Timo, >> >> Thanks for the feedback, and I agree with it. I will document this in the >> release notes. >> >> Best, >> Kurt >> >> >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai >> wrote: >> >>> Hi Kurt, >>> >>> With the same argument as before, given that it is mentioned in the >>> release >>> announcement that it is a preview feature, I would not block this release >>> because of it. >>> Nevertheless, it would be important to mention this explicitly in the >>> release notes [1]. >>> >>> Regards, >>> Gordon >>> >>> [1] https://github.com/apache/flink/pull/9438 >>> >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther wrote: >>> Hi Kurt, I agree that this is a serious bug. However, I would not block the release because of this. As you said, there is a workaround and the `execute()` works in the most common case of a single execution. We can fix this in a minor release shortly after. What do others think? Regards, Timo Am 15.08.19 um 11:23 schrieb Kurt Young: > HI, > > We just find a serious bug around blink planner: > https://issues.apache.org/jira/browse/FLINK-13708 > When user reused the table environment instance, and call `execute` method > multiple times for > different sql, the later call will trigger the earlier ones to be > re-executed. > > It's a serious bug but seems we also have a work around, which is >>> never > reuse the table environment > object. I'm not sure if we should treat this one as blocker issue of 1.9.0. > What's your opinion? > > Best, > Kurt > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: > >> +1 (non-binding) >> >> Jepsen test suite passed 10 times consecutively >> >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek < >>> aljos...@apache.org> >> wrote: >> >>> +1 >>> >>> I did some testing on a Google Cloud Dataproc cluster (it gives you >>> a >>> managed YARN and Google Cloud Storage (GCS)): >>>- tried both YARN session mode and YARN per-job mode, also using >>> bin/flink list/cancel/etc. against a YARN session cluster >>>- ran examples that write to GCS, both with the native Hadoop >> FileSystem >>> and a custom “plugin” FileSystem >>>- ran stateful streaming jobs that use GCS as a checkpoint >>> backend >>>- tried running SQL programs on YARN using the SQL Cli: this >>> worked for >>> YARN session mode but not for YARN per-job mode. Looking at the >>> code I >>> don’t think per-job mode would work from seeing how it is >>> implemented. >> But >>> I think it’s an OK restriction to have for now >>>- in all the testing I had fine-grained recovery (region >>> failover) >>> enabled but I didn’t simulate any failures >>> On 14. Aug 2019, at 15:20, Kurt Young wrote: Hi, Thanks for preparing this release candidate. I have verified the >>> following: - verified the checksums and GPG files match the corresponding >>> release >>> files - verified that the source archives do not contains any binaries - build the source release with Scala 2.11 successfully. - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and >>> [FLINK-13688], but both are not release blockers. Other than that, all tests are >>> passed. - ran all e2e tests which don't need download exte
Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer
Congratulations Andrey! On Thu, Aug 15, 2019 at 3:30 PM Fabian Hueske wrote: > Congrats Andrey! > > Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao : > > > Congratulations Andrey, well deserved! > > > > Best, > > Gary > > > > On Thu, Aug 15, 2019 at 7:50 AM Bowen Li wrote: > > > > > Congratulations Andrey! > > > > > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong > wrote: > > > > > >> Congratulations Andrey! > > >> > > >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok > wrote: > > >> > > >> > Congratulations Andrey! > > >> > At 2019-08-14 21:26:37, "Till Rohrmann" > wrote: > > >> > >Hi everyone, > > >> > > > > >> > >I'm very happy to announce that Andrey Zagrebin accepted the offer > of > > >> the > > >> > >Flink PMC to become a committer of the Flink project. > > >> > > > > >> > >Andrey has been an active community member for more than 15 months. > > He > > >> has > > >> > >helped shaping numerous features such as State TTL, FRocksDB > release, > > >> > >Shuffle service abstraction, FLIP-1, result partition management > and > > >> > >various fixes/improvements. He's also frequently helping out on the > > >> > >user@f.a.o mailing lists. > > >> > > > > >> > >Congratulations Andrey! > > >> > > > > >> > >Best, Till > > >> > >(on behalf of the Flink PMC) > > >> > > > >> > > > > > >
[jira] [Created] (FLINK-13737) flink-dist should add provided dependency on flink-examples-table
Dawid Wysakowicz created FLINK-13737: Summary: flink-dist should add provided dependency on flink-examples-table Key: FLINK-13737 URL: https://issues.apache.org/jira/browse/FLINK-13737 Project: Flink Issue Type: Improvement Components: Examples Affects Versions: 1.9.0 Reporter: Dawid Wysakowicz Assignee: Dawid Wysakowicz Fix For: 1.10.0, 1.9.1 In FLINK-13558 we changed the `flink-dist/bin.xml` to also include flink-examples-table in the binary distribution. The flink-dist module though does not depend on the flink-examples-table. If only the flink-dist module is built with its dependencies (this happens in the release scripts). The table examples are not built and thus not included in the distribution -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
After going through the licenses, I found 2 suspicions but not sure if they are valid or not. 1. flink-state-processing-api is packaged in to flink-dist jar, but not included in NOTICE-binary file (the one under the root directory) like other modules. 2. flink-runtime-web distributed some JavaScript dependencies through source codes, the licenses and NOTICE file were only updated inside the module of flink-runtime-web, but not the NOTICE file and licenses directory which under the root directory. Another minor issue I just found is: FLINK-13558 tries to include table examples to flink-dist, but I cannot find it in the binary distribution of RC2. Best, Kurt On Thu, Aug 15, 2019 at 6:19 PM Kurt Young wrote: > Hi Gordon & Timo, > > Thanks for the feedback, and I agree with it. I will document this in the > release notes. > > Best, > Kurt > > > On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai > wrote: > >> Hi Kurt, >> >> With the same argument as before, given that it is mentioned in the >> release >> announcement that it is a preview feature, I would not block this release >> because of it. >> Nevertheless, it would be important to mention this explicitly in the >> release notes [1]. >> >> Regards, >> Gordon >> >> [1] https://github.com/apache/flink/pull/9438 >> >> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther wrote: >> >> > Hi Kurt, >> > >> > I agree that this is a serious bug. However, I would not block the >> > release because of this. As you said, there is a workaround and the >> > `execute()` works in the most common case of a single execution. We can >> > fix this in a minor release shortly after. >> > >> > What do others think? >> > >> > Regards, >> > Timo >> > >> > >> > Am 15.08.19 um 11:23 schrieb Kurt Young: >> > > HI, >> > > >> > > We just find a serious bug around blink planner: >> > > https://issues.apache.org/jira/browse/FLINK-13708 >> > > When user reused the table environment instance, and call `execute` >> > method >> > > multiple times for >> > > different sql, the later call will trigger the earlier ones to be >> > > re-executed. >> > > >> > > It's a serious bug but seems we also have a work around, which is >> never >> > > reuse the table environment >> > > object. I'm not sure if we should treat this one as blocker issue of >> > 1.9.0. >> > > >> > > What's your opinion? >> > > >> > > Best, >> > > Kurt >> > > >> > > >> > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: >> > > >> > >> +1 (non-binding) >> > >> >> > >> Jepsen test suite passed 10 times consecutively >> > >> >> > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek < >> aljos...@apache.org> >> > >> wrote: >> > >> >> > >>> +1 >> > >>> >> > >>> I did some testing on a Google Cloud Dataproc cluster (it gives you >> a >> > >>> managed YARN and Google Cloud Storage (GCS)): >> > >>>- tried both YARN session mode and YARN per-job mode, also using >> > >>> bin/flink list/cancel/etc. against a YARN session cluster >> > >>>- ran examples that write to GCS, both with the native Hadoop >> > >> FileSystem >> > >>> and a custom “plugin” FileSystem >> > >>>- ran stateful streaming jobs that use GCS as a checkpoint >> backend >> > >>>- tried running SQL programs on YARN using the SQL Cli: this >> worked >> > for >> > >>> YARN session mode but not for YARN per-job mode. Looking at the >> code I >> > >>> don’t think per-job mode would work from seeing how it is >> implemented. >> > >> But >> > >>> I think it’s an OK restriction to have for now >> > >>>- in all the testing I had fine-grained recovery (region >> failover) >> > >>> enabled but I didn’t simulate any failures >> > >>> >> > On 14. Aug 2019, at 15:20, Kurt Young wrote: >> > >> > Hi, >> > >> > Thanks for preparing this release candidate. I have verified the >> > >>> following: >> > - verified the checksums and GPG files match the corresponding >> release >> > >>> files >> > - verified that the source archives do not contains any binaries >> > - build the source release with Scala 2.11 successfully. >> > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and >> > >>> [FLINK-13688], >> > but >> > both are not release blockers. Other than that, all tests are >> passed. >> > - ran all e2e tests which don't need download external packages >> (it's >> > >>> very >> > unstable >> > in China and almost impossible to download them), all passed. >> > - started local cluster, ran some examples. Met a small website >> > display >> > issue >> > [FLINK-13591], which is also not a release blocker. >> > >> > Although we have pushed some fixes around blink planner and hive >> > integration >> > after RC2, but consider these are both preview features, I'm lean >> to >> > be >> > >>> ok >> > to release >> > without these fixes. >> > >> > +1 from my side. (binding) >> > >> > Best, >> > Kurt >> > >> > >> > On Wed,
Re: [DISCUSS] FLIP-51: Rework of the Expression Design
Hi, regarding the LegacyTypeInformation esp. for decimals. I don't have a clear answer yet, but I think it should not limit us. If possible it should travel through the type inference and we only need some special cases at some locations e.g. when computing the leastRestrictive. E.g. the logical type root is set correctly which is required for family checking. I'm wondering when a decimal type with precision can occur. Usually, it should come from literals or defined column. But I think it might also be ok that the flink-planner just receives a decimal with precision and treats it with Java semantics. Doing it on the planner side is the easiest option but also the one that could cause side-effects if the back-and-forth conversion of LegacyTypeConverter is not a 1:1 mapping anymore. I guess we will see the implications during implementation. All old Flink tests should pass still. Regards, Timo Am 15.08.19 um 10:43 schrieb JingsongLee: Hi @Timo Walther @Dawid Wysakowicz: Now, flink-planner have some legacy DataTypes: like: legacy decimal, legacy basic array type info... And If the new type inference infer a Decimal/VarChar with precision, there should will fail in TypeConversions. The better we do on DataType, the more problems we will have with TypeInformation conversion, and the new TypeInference is a lot of precision related. What do you think? 1. Should TypeConversions support all data types and flink-planner support new types?2. Or do a special conversion between flink-planner and type inference. (There are many problems with the conversion between TypeInformation and DataType, and I think we should solve them completely in 1.10.) Best, Jingsong Lee -- From:JingsongLee Send Time:2019年8月15日(星期四) 10:31 To:dev Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design Hi jark: I'll add a chapter to list blink planner extended functions. Best, Jingsong Lee -- From:Jark Wu Send Time:2019年8月15日(星期四) 05:12 To:dev Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design Thanks Jingsong for starting the discussion. The general design of the FLIP looks good to me. +1 for the FLIP. It's time to get rid of the old Expression! Regarding to the function behavior, shall we also include new functions from blink planner (e.g. LISTAGG, REGEXP, TO_DATE, etc..) ? Best, Jark On Wed, 14 Aug 2019 at 23:34, Timo Walther wrote: Hi Jingsong, thanks for writing down this FLIP. Big +1 from my side to finally get rid of PlannerExpressions and have consistent and well-defined behavior for Table API and SQL updated to FLIP-37. We might need to discuss some of the behavior of particular functions but this should not affect the actual FLIP-51. Regards, Timo Am 13.08.19 um 12:55 schrieb JingsongLee: Hi everyone, We would like to start a discussion thread on "FLIP-51: Rework of the Expression Design"(Design doc: [1], FLIP: [2]), where we describe how to improve the new java Expressions to work with type inference and convert expression to the calcite RexNode. This is a follow-up plan for FLIP-32[3] and FLIP-37[4]. This FLIP is mostly based on FLIP-37. This FLIP addresses several shortcomings of current: - New Expressions still use PlannerExpressions to type inference and to RexNode. Flnk-planner and blink-planner have a lot of repetitive code and logic. - Let TableApi and Cacite definitions consistent. - Reduce the complexity of Function development. - Powerful Function for user. Key changes can be summarized as follows: - Improve the interface of FunctionDefinition. - Introduce type inference for built-in functions. - Introduce ExpressionConverter to convert Expression to calcite RexNode. - Remove repetitive code and logic in planners. I also listed type inference and behavior of all built-in functions [5], to verify that the interface is satisfied. After introduce type inference to table-common module, planners should have a unified function behavior. And this gives the community also the chance to quickly discuss types and behavior of functions a last time before they are declared stable. Looking forward to your feedbacks. Thank you. [1] https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design [3] https://cwiki.apache.org/confluence/display/FLINK/FLIP-32%3A+Restructure+flink-table+for+future+contributions [4] https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System [5] https://docs.google.com/document/d/1fyVmdGgbO1XmIyQ1BaoG_h5BcNcF3q9UJ1Bj1euO240/edit?usp=sharing Best, Jingsong Lee
Re: Watermarks not propagated to WebUI?
I remember an issue regarding the watermark fetch request from the WebUI exceeding some HTTP size limit, since it tries to fetch all watermarks at once, and the format of this request isn't exactly efficient. Querying metrics for individual operators still works since the request is small enough. Not sure whether we ever fixed that. On 15/08/2019 12:01, Jan Lukavský wrote: Hi, Thomas, thanks for confirming this. I have noticed, that in 1.9 the WebUI has been reworked a lot, does anyone know if this is still an issue? I currently cannot easily try 1.9, so I cannot confirm or disprove that. Jan On 8/14/19 6:25 PM, Thomas Weise wrote: I have also noticed this issue (Flink 1.5, Flink 1.8), and it appears with higher parallelism. This can be confusing to the user when watermarks actually work and can be observed using the metrics. On Wed, Aug 14, 2019 at 7:36 AM Jan Lukavský wrote: Hi, is it possible, that watermarks are sometimes not propagated to WebUI, although they are internally moving as normal? I see in WebUI every operator showing "No Watermark", but outputs seem to be propagated to sink (and there are watermark sensitive operations involved - e.g. reductions on fixed windows without early emitting). More strangely, this happens when I increase parallelism above some threshold. If I use parallelism of N, watermarks are shown, when I increase it above some number (seems not to be exactly deterministic), watermarks seems to disappear. I'm using Flink 1.8.1. Did anyone experience something like this before? Jan
Re: flink 1.9 DDL nested json derived
Hi Shengnan, Yes. Flink 1.9 supports nested json derived. You should declare the ROW type with nested schema explicitly. I tested a similar DDL against 1.9.0 RC2 and worked well. CREATE TABLE kafka_json_source ( rowtime VARCHAR, user_name VARCHAR, event ROW ) WITH ( 'connector.type' = 'kafka', 'connector.version' = 'universal', 'connector.topic' = 'test-json', 'connector.startup-mode' = 'earliest-offset', 'connector.properties.0.key' = 'zookeeper.connect', 'connector.properties.0.value' = 'localhost:2181', 'connector.properties.1.key' = 'bootstrap.servers', 'connector.properties.1.value' = 'localhost:9092', 'update-mode' = 'append', 'format.type' = 'json', 'format.derive-schema' = 'true' ); The kafka message is {"rowtime": "2018-03-12T08:00:00Z", "user_name": "Alice", "event": { "message_type": "WARNING", "message": "This is a warning."}} Thanks, Jark On Thu, 15 Aug 2019 at 14:12, Shengnan YU wrote: > > Hi guys > I am trying the DDL feature in branch 1.9-releasae. I am stucked in > creating a table from kafka with nested json format. Is it possibe to > specify a "Row" type of columns to derive the nested json schema? > > String sql = "create table kafka_stream(\n" + > " a varchar, \n" + > " b varchar,\n" + > " c int,\n" + > " inner_json row\n" + > ") with (\n" + > " 'connector.type' ='kafka',\n" + > " 'connector.version' = '0.11',\n" + > " 'update-mode' = 'append', \n" + > " 'connector.topic' = 'test',\n" + > " 'connector.properties.0.key' = 'bootstrap.servers',\n" + > " 'connector.properties.0.value' = 'localhost:9092',\n" + > " 'format.type' = 'json', \n" + > " 'format.derive-schema' = 'true'\n" + > ")\n"; > > Thank you very much! >
[jira] [Created] (FLINK-13736) Support count window with blink planner in batch mode
Kurt Young created FLINK-13736: -- Summary: Support count window with blink planner in batch mode Key: FLINK-13736 URL: https://issues.apache.org/jira/browse/FLINK-13736 Project: Flink Issue Type: New Feature Components: Table SQL / Planner, Table SQL / Runtime Reporter: Kurt Young -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13735) Support session window with blink planner in batch mode
Kurt Young created FLINK-13735: -- Summary: Support session window with blink planner in batch mode Key: FLINK-13735 URL: https://issues.apache.org/jira/browse/FLINK-13735 Project: Flink Issue Type: New Feature Components: Table SQL / Planner Reporter: Kurt Young -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Hi Gordon & Timo, Thanks for the feedback, and I agree with it. I will document this in the release notes. Best, Kurt On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai wrote: > Hi Kurt, > > With the same argument as before, given that it is mentioned in the release > announcement that it is a preview feature, I would not block this release > because of it. > Nevertheless, it would be important to mention this explicitly in the > release notes [1]. > > Regards, > Gordon > > [1] https://github.com/apache/flink/pull/9438 > > On Thu, Aug 15, 2019 at 11:29 AM Timo Walther wrote: > > > Hi Kurt, > > > > I agree that this is a serious bug. However, I would not block the > > release because of this. As you said, there is a workaround and the > > `execute()` works in the most common case of a single execution. We can > > fix this in a minor release shortly after. > > > > What do others think? > > > > Regards, > > Timo > > > > > > Am 15.08.19 um 11:23 schrieb Kurt Young: > > > HI, > > > > > > We just find a serious bug around blink planner: > > > https://issues.apache.org/jira/browse/FLINK-13708 > > > When user reused the table environment instance, and call `execute` > > method > > > multiple times for > > > different sql, the later call will trigger the earlier ones to be > > > re-executed. > > > > > > It's a serious bug but seems we also have a work around, which is never > > > reuse the table environment > > > object. I'm not sure if we should treat this one as blocker issue of > > 1.9.0. > > > > > > What's your opinion? > > > > > > Best, > > > Kurt > > > > > > > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: > > > > > >> +1 (non-binding) > > >> > > >> Jepsen test suite passed 10 times consecutively > > >> > > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek > > > >> wrote: > > >> > > >>> +1 > > >>> > > >>> I did some testing on a Google Cloud Dataproc cluster (it gives you a > > >>> managed YARN and Google Cloud Storage (GCS)): > > >>>- tried both YARN session mode and YARN per-job mode, also using > > >>> bin/flink list/cancel/etc. against a YARN session cluster > > >>>- ran examples that write to GCS, both with the native Hadoop > > >> FileSystem > > >>> and a custom “plugin” FileSystem > > >>>- ran stateful streaming jobs that use GCS as a checkpoint backend > > >>>- tried running SQL programs on YARN using the SQL Cli: this > worked > > for > > >>> YARN session mode but not for YARN per-job mode. Looking at the code > I > > >>> don’t think per-job mode would work from seeing how it is > implemented. > > >> But > > >>> I think it’s an OK restriction to have for now > > >>>- in all the testing I had fine-grained recovery (region failover) > > >>> enabled but I didn’t simulate any failures > > >>> > > On 14. Aug 2019, at 15:20, Kurt Young wrote: > > > > Hi, > > > > Thanks for preparing this release candidate. I have verified the > > >>> following: > > - verified the checksums and GPG files match the corresponding > release > > >>> files > > - verified that the source archives do not contains any binaries > > - build the source release with Scala 2.11 successfully. > > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and > > >>> [FLINK-13688], > > but > > both are not release blockers. Other than that, all tests are > passed. > > - ran all e2e tests which don't need download external packages > (it's > > >>> very > > unstable > > in China and almost impossible to download them), all passed. > > - started local cluster, ran some examples. Met a small website > > display > > issue > > [FLINK-13591], which is also not a release blocker. > > > > Although we have pushed some fixes around blink planner and hive > > integration > > after RC2, but consider these are both preview features, I'm lean to > > be > > >>> ok > > to release > > without these fixes. > > > > +1 from my side. (binding) > > > > Best, > > Kurt > > > > > > On Wed, Aug 14, 2019 at 5:13 PM Jark Wu wrote: > > > > > Hi Gordon, > > > > > > I have verified the following things: > > > > > > - build the source release with Scala 2.12 and Scala 2.11 > > successfully > > > - checked/verified signatures and hashes > > > - checked that all POM files point to the same version > > > - ran some flink table related end-to-end tests locally and > succeeded > > > (except TPC-H e2e failed which is reported in FLINK-13704) > > > - started cluster for both Scala 2.11 and 2.12, ran examples, > > verified > > >>> web > > > ui and log output, nothing unexpected > > > - started cluster, ran a SQL query to temporal join with kafka > source > > >>> and > > > mysql jdbc table, and write results to kafka again. Using DDL to > > >> create > > >>> the > > > source and sinks. looks good. > > > - reviewed the release PR
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Hi Kurt, With the same argument as before, given that it is mentioned in the release announcement that it is a preview feature, I would not block this release because of it. Nevertheless, it would be important to mention this explicitly in the release notes [1]. Regards, Gordon [1] https://github.com/apache/flink/pull/9438 On Thu, Aug 15, 2019 at 11:29 AM Timo Walther wrote: > Hi Kurt, > > I agree that this is a serious bug. However, I would not block the > release because of this. As you said, there is a workaround and the > `execute()` works in the most common case of a single execution. We can > fix this in a minor release shortly after. > > What do others think? > > Regards, > Timo > > > Am 15.08.19 um 11:23 schrieb Kurt Young: > > HI, > > > > We just find a serious bug around blink planner: > > https://issues.apache.org/jira/browse/FLINK-13708 > > When user reused the table environment instance, and call `execute` > method > > multiple times for > > different sql, the later call will trigger the earlier ones to be > > re-executed. > > > > It's a serious bug but seems we also have a work around, which is never > > reuse the table environment > > object. I'm not sure if we should treat this one as blocker issue of > 1.9.0. > > > > What's your opinion? > > > > Best, > > Kurt > > > > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: > > > >> +1 (non-binding) > >> > >> Jepsen test suite passed 10 times consecutively > >> > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek > >> wrote: > >> > >>> +1 > >>> > >>> I did some testing on a Google Cloud Dataproc cluster (it gives you a > >>> managed YARN and Google Cloud Storage (GCS)): > >>>- tried both YARN session mode and YARN per-job mode, also using > >>> bin/flink list/cancel/etc. against a YARN session cluster > >>>- ran examples that write to GCS, both with the native Hadoop > >> FileSystem > >>> and a custom “plugin” FileSystem > >>>- ran stateful streaming jobs that use GCS as a checkpoint backend > >>>- tried running SQL programs on YARN using the SQL Cli: this worked > for > >>> YARN session mode but not for YARN per-job mode. Looking at the code I > >>> don’t think per-job mode would work from seeing how it is implemented. > >> But > >>> I think it’s an OK restriction to have for now > >>>- in all the testing I had fine-grained recovery (region failover) > >>> enabled but I didn’t simulate any failures > >>> > On 14. Aug 2019, at 15:20, Kurt Young wrote: > > Hi, > > Thanks for preparing this release candidate. I have verified the > >>> following: > - verified the checksums and GPG files match the corresponding release > >>> files > - verified that the source archives do not contains any binaries > - build the source release with Scala 2.11 successfully. > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and > >>> [FLINK-13688], > but > both are not release blockers. Other than that, all tests are passed. > - ran all e2e tests which don't need download external packages (it's > >>> very > unstable > in China and almost impossible to download them), all passed. > - started local cluster, ran some examples. Met a small website > display > issue > [FLINK-13591], which is also not a release blocker. > > Although we have pushed some fixes around blink planner and hive > integration > after RC2, but consider these are both preview features, I'm lean to > be > >>> ok > to release > without these fixes. > > +1 from my side. (binding) > > Best, > Kurt > > > On Wed, Aug 14, 2019 at 5:13 PM Jark Wu wrote: > > > Hi Gordon, > > > > I have verified the following things: > > > > - build the source release with Scala 2.12 and Scala 2.11 > successfully > > - checked/verified signatures and hashes > > - checked that all POM files point to the same version > > - ran some flink table related end-to-end tests locally and succeeded > > (except TPC-H e2e failed which is reported in FLINK-13704) > > - started cluster for both Scala 2.11 and 2.12, ran examples, > verified > >>> web > > ui and log output, nothing unexpected > > - started cluster, ran a SQL query to temporal join with kafka source > >>> and > > mysql jdbc table, and write results to kafka again. Using DDL to > >> create > >>> the > > source and sinks. looks good. > > - reviewed the release PR > > > > As FLINK-13704 is not recognized as blocker issue, so +1 from my side > > (non-binding). > > > > On Tue, 13 Aug 2019 at 17:07, Till Rohrmann > >>> wrote: > >> Hi Richard, > >> > >> although I can see that it would be handy for users who have PubSub > >> set > > up, > >> I would rather not include examples which require an external > >>> dependency > >> into the Flink distribution. I think examples should be > >> se
Re: Watermarks not propagated to WebUI?
Hi, Thomas, thanks for confirming this. I have noticed, that in 1.9 the WebUI has been reworked a lot, does anyone know if this is still an issue? I currently cannot easily try 1.9, so I cannot confirm or disprove that. Jan On 8/14/19 6:25 PM, Thomas Weise wrote: I have also noticed this issue (Flink 1.5, Flink 1.8), and it appears with higher parallelism. This can be confusing to the user when watermarks actually work and can be observed using the metrics. On Wed, Aug 14, 2019 at 7:36 AM Jan Lukavský wrote: Hi, is it possible, that watermarks are sometimes not propagated to WebUI, although they are internally moving as normal? I see in WebUI every operator showing "No Watermark", but outputs seem to be propagated to sink (and there are watermark sensitive operations involved - e.g. reductions on fixed windows without early emitting). More strangely, this happens when I increase parallelism above some threshold. If I use parallelism of N, watermarks are shown, when I increase it above some number (seems not to be exactly deterministic), watermarks seems to disappear. I'm using Flink 1.8.1. Did anyone experience something like this before? Jan
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
+1 (non-binding) Tested in AWS EMR Yarn: 1 master and 4 worker nodes (m5.xlarge: 4 vCore, 16 GiB). EMR runs only on Java 8. Fine-grained recovery is enabled by default. Modified E2E test scripts can be found here (asserting output): https://github.com/azagrebin/flink/commits/FLINK-13597 Batch SQL: - S3(a) filesystem over HADOOP works out-of-the-box (already on AWS class path) and also if put in plugins Streaming SQL: - Hadoop output (s3 does not support recoverable writers) On Thu, Aug 15, 2019 at 11:24 AM Kurt Young wrote: > HI, > > We just find a serious bug around blink planner: > https://issues.apache.org/jira/browse/FLINK-13708 > When user reused the table environment instance, and call `execute` method > multiple times for > different sql, the later call will trigger the earlier ones to be > re-executed. > > It's a serious bug but seems we also have a work around, which is never > reuse the table environment > object. I'm not sure if we should treat this one as blocker issue of 1.9.0. > > What's your opinion? > > Best, > Kurt > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: > > > +1 (non-binding) > > > > Jepsen test suite passed 10 times consecutively > > > > On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek > > wrote: > > > > > +1 > > > > > > I did some testing on a Google Cloud Dataproc cluster (it gives you a > > > managed YARN and Google Cloud Storage (GCS)): > > > - tried both YARN session mode and YARN per-job mode, also using > > > bin/flink list/cancel/etc. against a YARN session cluster > > > - ran examples that write to GCS, both with the native Hadoop > > FileSystem > > > and a custom “plugin” FileSystem > > > - ran stateful streaming jobs that use GCS as a checkpoint backend > > > - tried running SQL programs on YARN using the SQL Cli: this worked > for > > > YARN session mode but not for YARN per-job mode. Looking at the code I > > > don’t think per-job mode would work from seeing how it is implemented. > > But > > > I think it’s an OK restriction to have for now > > > - in all the testing I had fine-grained recovery (region failover) > > > enabled but I didn’t simulate any failures > > > > > > > On 14. Aug 2019, at 15:20, Kurt Young wrote: > > > > > > > > Hi, > > > > > > > > Thanks for preparing this release candidate. I have verified the > > > following: > > > > > > > > - verified the checksums and GPG files match the corresponding > release > > > files > > > > - verified that the source archives do not contains any binaries > > > > - build the source release with Scala 2.11 successfully. > > > > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and > > > [FLINK-13688], > > > > but > > > > both are not release blockers. Other than that, all tests are passed. > > > > - ran all e2e tests which don't need download external packages (it's > > > very > > > > unstable > > > > in China and almost impossible to download them), all passed. > > > > - started local cluster, ran some examples. Met a small website > display > > > > issue > > > > [FLINK-13591], which is also not a release blocker. > > > > > > > > Although we have pushed some fixes around blink planner and hive > > > > integration > > > > after RC2, but consider these are both preview features, I'm lean to > be > > > ok > > > > to release > > > > without these fixes. > > > > > > > > +1 from my side. (binding) > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:13 PM Jark Wu wrote: > > > > > > > >> Hi Gordon, > > > >> > > > >> I have verified the following things: > > > >> > > > >> - build the source release with Scala 2.12 and Scala 2.11 > successfully > > > >> - checked/verified signatures and hashes > > > >> - checked that all POM files point to the same version > > > >> - ran some flink table related end-to-end tests locally and > succeeded > > > >> (except TPC-H e2e failed which is reported in FLINK-13704) > > > >> - started cluster for both Scala 2.11 and 2.12, ran examples, > verified > > > web > > > >> ui and log output, nothing unexpected > > > >> - started cluster, ran a SQL query to temporal join with kafka > source > > > and > > > >> mysql jdbc table, and write results to kafka again. Using DDL to > > create > > > the > > > >> source and sinks. looks good. > > > >> - reviewed the release PR > > > >> > > > >> As FLINK-13704 is not recognized as blocker issue, so +1 from my > side > > > >> (non-binding). > > > >> > > > >> On Tue, 13 Aug 2019 at 17:07, Till Rohrmann > > > wrote: > > > >> > > > >>> Hi Richard, > > > >>> > > > >>> although I can see that it would be handy for users who have PubSub > > set > > > >> up, > > > >>> I would rather not include examples which require an external > > > dependency > > > >>> into the Flink distribution. I think examples should be > > self-contained. > > > >> My > > > >>> concern is that we would bloat the distribution for many users at > the > > > >>> benefit of a few. Instead, I think it would be better to mak
Re: How to load udf jars in flink program
Hi Jiangang, Does "flink run -j jarpath ..." work for you? If that jar id deployed to the same path on each worker machine, you can try "flink run -C classpath ..." as well. Thanks, Zhu Zhu 刘建刚 于2019年8月15日周四 下午5:31写道: > We are using per-job to load udf jar when start job. Our jar file is > in another path but not flink's lib path. In the main function, we use > classLoader to load the jar file by the jar path. But it reports the > following error when job starts running. > If the jar file is in lib, everything is ok. But our udf jar file is > managed in a special path. How can I load udf jars in flink program with > only giving the jar path? > > org.apache.flink.api.common.InvalidProgramException: Table program cannot be > compiled. This is a bug. Please file an issue. > at > org.apache.flink.table.codegen.Compiler$class.compile(Compiler.scala:36) > at > org.apache.flink.table.runtime.CRowProcessRunner.compile(CRowProcessRunner.scala:35) > at > org.apache.flink.table.runtime.CRowProcessRunner.open(CRowProcessRunner.scala:49) > at > org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36) > at > org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102) > at > org.apache.flink.streaming.api.operators.ProcessOperator.open(ProcessOperator.java:56) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:723) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.codehaus.commons.compiler.CompileException: Line 5, Column 1: > Cannot determine simple type name "com" > at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11877) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6758) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6519) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) > at > org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6498) > at org.codehaus.janino.UnitCompiler.access$14000(UnitCompiler.java:218) > at > org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6405) > at > org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6400) > at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3983) > at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6400) > at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6393) > at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3982) > at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393) > at org.codehaus.janino.UnitCompiler.access$1300(UnitCompiler.java:218) > at org.codehaus.janino.UnitCompiler$25.getType(UnitCompiler.java:8206) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6798) > at org.codehaus.janino.UnitCompiler.access$14500(UnitCompiler.java:218) > at > org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6423) > at > org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6418) > at org.codehaus.janino.Java$FieldAccess.accept(Java.java:4365) > at > org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418) > at > org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414) > at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203) > at > org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414) > at > org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393) > at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171) > at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393) > at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6780) > at org.codehaus.janino.UnitCompiler.access$14300(UnitCompiler.java:218) > at > org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6421) > at > org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6418) > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4279) > at > org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418) > at > org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414) > at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203) >
How to load udf jars in flink program
We are using per-job to load udf jar when start job. Our jar file is in another path but not flink's lib path. In the main function, we use classLoader to load the jar file by the jar path. But it reports the following error when job starts running. If the jar file is in lib, everything is ok. But our udf jar file is managed in a special path. How can I load udf jars in flink program with only giving the jar path? org.apache.flink.api.common.InvalidProgramException: Table program cannot be compiled. This is a bug. Please file an issue. at org.apache.flink.table.codegen.Compiler$class.compile(Compiler.scala:36) at org.apache.flink.table.runtime.CRowProcessRunner.compile(CRowProcessRunner.scala:35) at org.apache.flink.table.runtime.CRowProcessRunner.open(CRowProcessRunner.scala:49) at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36) at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102) at org.apache.flink.streaming.api.operators.ProcessOperator.open(ProcessOperator.java:56) at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:723) at java.lang.Thread.run(Thread.java:745) Caused by: org.codehaus.commons.compiler.CompileException: Line 5, Column 1: Cannot determine simple type name "com" at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11877) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6758) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6519) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) at org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6498) at org.codehaus.janino.UnitCompiler.access$14000(UnitCompiler.java:218) at org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6405) at org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6400) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3983) at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6400) at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6393) at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3982) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393) at org.codehaus.janino.UnitCompiler.access$1300(UnitCompiler.java:218) at org.codehaus.janino.UnitCompiler$25.getType(UnitCompiler.java:8206) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6798) at org.codehaus.janino.UnitCompiler.access$14500(UnitCompiler.java:218) at org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6423) at org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6418) at org.codehaus.janino.Java$FieldAccess.accept(Java.java:4365) at org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418) at org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414) at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203) at org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414) at org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393) at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393) at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6780) at org.codehaus.janino.UnitCompiler.access$14300(UnitCompiler.java:218) at org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6421) at org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6418) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4279) at org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418) at org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414) at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203) at org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414) at org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393) at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171) at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
Hi Kurt, I agree that this is a serious bug. However, I would not block the release because of this. As you said, there is a workaround and the `execute()` works in the most common case of a single execution. We can fix this in a minor release shortly after. What do others think? Regards, Timo Am 15.08.19 um 11:23 schrieb Kurt Young: HI, We just find a serious bug around blink planner: https://issues.apache.org/jira/browse/FLINK-13708 When user reused the table environment instance, and call `execute` method multiple times for different sql, the later call will trigger the earlier ones to be re-executed. It's a serious bug but seems we also have a work around, which is never reuse the table environment object. I'm not sure if we should treat this one as blocker issue of 1.9.0. What's your opinion? Best, Kurt On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: +1 (non-binding) Jepsen test suite passed 10 times consecutively On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek wrote: +1 I did some testing on a Google Cloud Dataproc cluster (it gives you a managed YARN and Google Cloud Storage (GCS)): - tried both YARN session mode and YARN per-job mode, also using bin/flink list/cancel/etc. against a YARN session cluster - ran examples that write to GCS, both with the native Hadoop FileSystem and a custom “plugin” FileSystem - ran stateful streaming jobs that use GCS as a checkpoint backend - tried running SQL programs on YARN using the SQL Cli: this worked for YARN session mode but not for YARN per-job mode. Looking at the code I don’t think per-job mode would work from seeing how it is implemented. But I think it’s an OK restriction to have for now - in all the testing I had fine-grained recovery (region failover) enabled but I didn’t simulate any failures On 14. Aug 2019, at 15:20, Kurt Young wrote: Hi, Thanks for preparing this release candidate. I have verified the following: - verified the checksums and GPG files match the corresponding release files - verified that the source archives do not contains any binaries - build the source release with Scala 2.11 successfully. - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and [FLINK-13688], but both are not release blockers. Other than that, all tests are passed. - ran all e2e tests which don't need download external packages (it's very unstable in China and almost impossible to download them), all passed. - started local cluster, ran some examples. Met a small website display issue [FLINK-13591], which is also not a release blocker. Although we have pushed some fixes around blink planner and hive integration after RC2, but consider these are both preview features, I'm lean to be ok to release without these fixes. +1 from my side. (binding) Best, Kurt On Wed, Aug 14, 2019 at 5:13 PM Jark Wu wrote: Hi Gordon, I have verified the following things: - build the source release with Scala 2.12 and Scala 2.11 successfully - checked/verified signatures and hashes - checked that all POM files point to the same version - ran some flink table related end-to-end tests locally and succeeded (except TPC-H e2e failed which is reported in FLINK-13704) - started cluster for both Scala 2.11 and 2.12, ran examples, verified web ui and log output, nothing unexpected - started cluster, ran a SQL query to temporal join with kafka source and mysql jdbc table, and write results to kafka again. Using DDL to create the source and sinks. looks good. - reviewed the release PR As FLINK-13704 is not recognized as blocker issue, so +1 from my side (non-binding). On Tue, 13 Aug 2019 at 17:07, Till Rohrmann wrote: Hi Richard, although I can see that it would be handy for users who have PubSub set up, I would rather not include examples which require an external dependency into the Flink distribution. I think examples should be self-contained. My concern is that we would bloat the distribution for many users at the benefit of a few. Instead, I think it would be better to make these examples available differently, maybe through Flink's ecosystem website or maybe a new examples section in Flink's documentation. Cheers, Till On Tue, Aug 13, 2019 at 9:43 AM Jark Wu wrote: Hi Till, After thinking about we can use VARCHAR as an alternative of timestamp/time/date. I'm fine with not recognize it as a blocker issue. We can fix it into 1.9.1. Thanks, Jark On Tue, 13 Aug 2019 at 15:10, Richard Deurwaarder wrote: Hello all, I noticed the PubSub example jar is not included in the examples/ dir of flink-dist. I've created https://issues.apache.org/jira/browse/FLINK-13700 + https://github.com/apache/flink/pull/9424/files to fix this. I will leave it up to you to decide if we want to add this to 1.9.0. Regards, Richard On Tue, Aug 13, 2019 at 9:04 AM Till Rohrmann < trohrm...@apache.org> wrote: Hi Jark, thanks for reporting this issue. Could this be a documented limitation of Blink'
Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2
HI, We just find a serious bug around blink planner: https://issues.apache.org/jira/browse/FLINK-13708 When user reused the table environment instance, and call `execute` method multiple times for different sql, the later call will trigger the earlier ones to be re-executed. It's a serious bug but seems we also have a work around, which is never reuse the table environment object. I'm not sure if we should treat this one as blocker issue of 1.9.0. What's your opinion? Best, Kurt On Thu, Aug 15, 2019 at 2:01 PM Gary Yao wrote: > +1 (non-binding) > > Jepsen test suite passed 10 times consecutively > > On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek > wrote: > > > +1 > > > > I did some testing on a Google Cloud Dataproc cluster (it gives you a > > managed YARN and Google Cloud Storage (GCS)): > > - tried both YARN session mode and YARN per-job mode, also using > > bin/flink list/cancel/etc. against a YARN session cluster > > - ran examples that write to GCS, both with the native Hadoop > FileSystem > > and a custom “plugin” FileSystem > > - ran stateful streaming jobs that use GCS as a checkpoint backend > > - tried running SQL programs on YARN using the SQL Cli: this worked for > > YARN session mode but not for YARN per-job mode. Looking at the code I > > don’t think per-job mode would work from seeing how it is implemented. > But > > I think it’s an OK restriction to have for now > > - in all the testing I had fine-grained recovery (region failover) > > enabled but I didn’t simulate any failures > > > > > On 14. Aug 2019, at 15:20, Kurt Young wrote: > > > > > > Hi, > > > > > > Thanks for preparing this release candidate. I have verified the > > following: > > > > > > - verified the checksums and GPG files match the corresponding release > > files > > > - verified that the source archives do not contains any binaries > > > - build the source release with Scala 2.11 successfully. > > > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and > > [FLINK-13688], > > > but > > > both are not release blockers. Other than that, all tests are passed. > > > - ran all e2e tests which don't need download external packages (it's > > very > > > unstable > > > in China and almost impossible to download them), all passed. > > > - started local cluster, ran some examples. Met a small website display > > > issue > > > [FLINK-13591], which is also not a release blocker. > > > > > > Although we have pushed some fixes around blink planner and hive > > > integration > > > after RC2, but consider these are both preview features, I'm lean to be > > ok > > > to release > > > without these fixes. > > > > > > +1 from my side. (binding) > > > > > > Best, > > > Kurt > > > > > > > > > On Wed, Aug 14, 2019 at 5:13 PM Jark Wu wrote: > > > > > >> Hi Gordon, > > >> > > >> I have verified the following things: > > >> > > >> - build the source release with Scala 2.12 and Scala 2.11 successfully > > >> - checked/verified signatures and hashes > > >> - checked that all POM files point to the same version > > >> - ran some flink table related end-to-end tests locally and succeeded > > >> (except TPC-H e2e failed which is reported in FLINK-13704) > > >> - started cluster for both Scala 2.11 and 2.12, ran examples, verified > > web > > >> ui and log output, nothing unexpected > > >> - started cluster, ran a SQL query to temporal join with kafka source > > and > > >> mysql jdbc table, and write results to kafka again. Using DDL to > create > > the > > >> source and sinks. looks good. > > >> - reviewed the release PR > > >> > > >> As FLINK-13704 is not recognized as blocker issue, so +1 from my side > > >> (non-binding). > > >> > > >> On Tue, 13 Aug 2019 at 17:07, Till Rohrmann > > wrote: > > >> > > >>> Hi Richard, > > >>> > > >>> although I can see that it would be handy for users who have PubSub > set > > >> up, > > >>> I would rather not include examples which require an external > > dependency > > >>> into the Flink distribution. I think examples should be > self-contained. > > >> My > > >>> concern is that we would bloat the distribution for many users at the > > >>> benefit of a few. Instead, I think it would be better to make these > > >>> examples available differently, maybe through Flink's ecosystem > website > > >> or > > >>> maybe a new examples section in Flink's documentation. > > >>> > > >>> Cheers, > > >>> Till > > >>> > > >>> On Tue, Aug 13, 2019 at 9:43 AM Jark Wu wrote: > > >>> > > Hi Till, > > > > After thinking about we can use VARCHAR as an alternative of > > timestamp/time/date. > > I'm fine with not recognize it as a blocker issue. > > We can fix it into 1.9.1. > > > > > > Thanks, > > Jark > > > > > > On Tue, 13 Aug 2019 at 15:10, Richard Deurwaarder > > >>> wrote: > > > > > Hello all, > > > > > > I noticed the PubSub example jar is not included in the examples/ > dir > > >>> of > > > flink-dist. I've created > >
Re: [DISCUSS] Reducing build times
Thanks for starting this discussion Chesnay. I think it has become obvious to the Flink community that with the existing build setup we cannot really deliver fast build times which are essential for fast iteration cycles and high developer productivity. The reasons for this situation are manifold but it is definitely affected by Flink's project growth, not always optimal tests and the inflexibility that everything needs to be built. Hence, I consider the reduction of build times crucial for the project's health and future growth. Without necessarily voicing a strong preference for any of the presented suggestions, I wanted to comment on each of them: 1. This sounds promising. Could the reason why we don't reuse JVMs date back to the time when we still had a lot of static fields in Flink which made it hard to reuse JVMs and the potentially mutated global state? 2. Building hand-crafted solutions around a build system in order to compensate for its limitations which other build systems support out of the box sounds like the not invented here syndrome to me. Reinventing the wheel has historically proven to be usually not the best solution and it often comes with a high maintenance price tag. Moreover, it would add just another layer of complexity around our existing build system. I think the current state where we have the maven setup in pom files and for Travis multiple bash scripts specializing the builds to make it fit the time limit is already not very transparent/easy to understand. 3. I could see this work but it also requires a very good understanding of Flink of every committer because the committer needs to know which tests would be good to run additionally. 4. I would be against this option solely to decrease our build time. My observation is that the community does not monitor the health of the cron jobs well enough. In the past the cron jobs have been unstable for as long as a complete release cycle. Moreover, I've seen that PRs were merged which passed Travis but broke the cron jobs. Consequently, I fear that this option would deteriorate Flink's stability. 5. I would rephrase this point into changing the build system. Gradle could be one candidate but there are also other build systems out there like Bazel. Changing the build system would indeed be a major endeavour but I could see the long term benefits of such a change (similar to having a consistent and enforced code style) in particular if the build system supports the functionality which we would otherwise build & maintain on our own. I think there would be ways to make the transition not as disruptive as described. For example, one could keep the Maven build and the new build side by side until one is confident enough that the new build produces the same output as the Maven build. Maybe it would also be possible to migrate individual modules starting from the leaves. However, I admit that changing the build system will affect every Flink developer because she needs to learn & understand it. 6. I would like to learn about other people's experience with different CI systems. Travis worked okish for Flink so far but we see sometimes problems with its caching mechanism as Chesnay stated. I think that this topic is actually orthogonal to the other suggestions. My gut feeling is that not a single suggestion will be our solution but a combination of them. Cheers, Till On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu wrote: > Thanks Chesnay for bringing up this discussion and sharing those thoughts > to speed up the building process. > > I'd +1 for option 2 and 3. > > We can benefits a lot from Option 2. Developing table, connectors, > libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to > run. > PRs for those modules take up more than half of all the PRs in my > observation. > > Option 3 can be a supplementary to option 2 that if the PR is modifying > fundamental modules like flink-core or flink-runtime. > It can even be a switch of the tests scope(basic/full) of a PR, so that > committers do not need to trigger it multiple times. > With it we can postpone the testing of IT cases or connectors before the PR > reaches a stable state. > > Thanks, > Zhu Zhu > > Chesnay Schepler 于2019年8月15日周四 下午3:38写道: > > > Hello everyone, > > > > improving our build times is a hot topic at the moment so let's discuss > > the different ways how they could be reduced. > > > > > > Current state: > > > > First up, let's look at some numbers: > > > > 1 full build currently consumes 5h of build time total ("total time"), > > and in the ideal case takes about 1h20m ("run time") to complete from > > start to finish. The run time may fluctuate of course depending on the > > current Travis load. This applies both to builds on the Apache and > > flink-ci Travis. > > > > At the time of writing, the current queue time for PR jobs (reminder: > > running on flink-ci) is about 30 minutes (which basically means that we > > are processing builds at
Re: [DISCUSS] Reducing build times
Thanks Chesnay for bringing up this discussion and sharing those thoughts to speed up the building process. I'd +1 for option 2 and 3. We can benefits a lot from Option 2. Developing table, connectors, libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to run. PRs for those modules take up more than half of all the PRs in my observation. Option 3 can be a supplementary to option 2 that if the PR is modifying fundamental modules like flink-core or flink-runtime. It can even be a switch of the tests scope(basic/full) of a PR, so that committers do not need to trigger it multiple times. With it we can postpone the testing of IT cases or connectors before the PR reaches a stable state. Thanks, Zhu Zhu Chesnay Schepler 于2019年8月15日周四 下午3:38写道: > Hello everyone, > > improving our build times is a hot topic at the moment so let's discuss > the different ways how they could be reduced. > > > Current state: > > First up, let's look at some numbers: > > 1 full build currently consumes 5h of build time total ("total time"), > and in the ideal case takes about 1h20m ("run time") to complete from > start to finish. The run time may fluctuate of course depending on the > current Travis load. This applies both to builds on the Apache and > flink-ci Travis. > > At the time of writing, the current queue time for PR jobs (reminder: > running on flink-ci) is about 30 minutes (which basically means that we > are processing builds at the rate that they come in), however we are in > an admittedly quiet period right now. > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as > everyone was scrambling to get their changes merged in time for the > feature freeze. > > (Note: Recently optimizations where added to ci-bot where pending builds > are canceled if a new commit was pushed to the PR or the PR was closed, > which should prove especially useful during the rush hours we see before > feature-freezes.) > > > Past approaches > > Over the years we have done rather few things to improve this situation > (hence our current predicament). > > Beyond the sporadic speedup of some tests, the only notable reduction in > total build times was the introduction of cron jobs, which consolidated > the per-commit matrix from 4 configurations (different scala/hadoop > versions) to 1. > > The separation into multiple build profiles was only a work-around for > the 50m limit on Travis. Running tests in parallel has the obvious > potential of reducing run time, but we're currently hitting a hard limit > since a few modules (flink-tests, flink-runtime, > flink-table-planner-blink) are so loaded with tests that they nearly > consume an entire profile by themselves (and thus no further splitting > is possible). > > The rework that introduced stages, at the time of introduction, did also > not provide a speed up, although this changed slightly once more > profiles were added and some optimizations to the caching have been made. > > Very recently we modified the surefire-plugin configuration for > flink-table-planner-blink to reuse JVM forks for IT cases, providing a > significant speedup (18 minutes!). So far we have not seen any negative > consequences. > > > Suggestions > > This is a list of /all /suggestions for reducing run/total times that I > have seen recently (in other words, they aren't necessarily mine nor may > I agree with all of them). > > 1. Enable JVM reuse for IT cases in more modules. > * We've seen significant speedups in the blink planner, and this > should be applicable for all modules. However, I presume there's > a reason why we disabled JVM reuse (information on this would be > appreciated) > 2. Custom differential build scripts > * Setup custom scripts for determining which modules might be > affected by change, and manipulate the splits accordingly. This > approach is conceptually quite straight-forward, but has limits > since it has to be pessimistic; i.e. a change in flink-core > _must_ result in testing all modules. > 3. Only run smoke tests when PR is opened, run heavy tests on demand. > * With the introduction of the ci-bot we now have significantly > more options on how to handle PR builds. One option could be to > only run basic tests when the PR is created (which may be only > modified modules, or all unit tests, or another low-cost > scheme), and then have a committer trigger other builds (full > test run, e2e tests, etc...) on demand. > 4. Move more tests into cron builds > * The budget version of 3); move certain tests that are either > expensive (like some runtime tests that take minutes) or in > rarely modified modules (like gelly) into cron jobs. > 5. Gradle > * Gradle was brought up a few times for it's built-in support for > differential builds; basically providing 2) without the overhead >
Re: [DISCUSS] FLIP-51: Rework of the Expression Design
Hi @Timo Walther @Dawid Wysakowicz: Now, flink-planner have some legacy DataTypes: like: legacy decimal, legacy basic array type info... And If the new type inference infer a Decimal/VarChar with precision, there should will fail in TypeConversions. The better we do on DataType, the more problems we will have with TypeInformation conversion, and the new TypeInference is a lot of precision related. What do you think? 1. Should TypeConversions support all data types and flink-planner support new types?2. Or do a special conversion between flink-planner and type inference. (There are many problems with the conversion between TypeInformation and DataType, and I think we should solve them completely in 1.10.) Best, Jingsong Lee -- From:JingsongLee Send Time:2019年8月15日(星期四) 10:31 To:dev Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design Hi jark: I'll add a chapter to list blink planner extended functions. Best, Jingsong Lee -- From:Jark Wu Send Time:2019年8月15日(星期四) 05:12 To:dev Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design Thanks Jingsong for starting the discussion. The general design of the FLIP looks good to me. +1 for the FLIP. It's time to get rid of the old Expression! Regarding to the function behavior, shall we also include new functions from blink planner (e.g. LISTAGG, REGEXP, TO_DATE, etc..) ? Best, Jark On Wed, 14 Aug 2019 at 23:34, Timo Walther wrote: > Hi Jingsong, > > thanks for writing down this FLIP. Big +1 from my side to finally get > rid of PlannerExpressions and have consistent and well-defined behavior > for Table API and SQL updated to FLIP-37. > > We might need to discuss some of the behavior of particular functions > but this should not affect the actual FLIP-51. > > Regards, > Timo > > > Am 13.08.19 um 12:55 schrieb JingsongLee: > > Hi everyone, > > > > We would like to start a discussion thread on "FLIP-51: Rework of the > > Expression Design"(Design doc: [1], FLIP: [2]), where we describe how > > to improve the new java Expressions to work with type inference and > > convert expression to the calcite RexNode. This is a follow-up plan > > for FLIP-32[3] and FLIP-37[4]. This FLIP is mostly based on FLIP-37. > > > > This FLIP addresses several shortcomings of current: > > - New Expressions still use PlannerExpressions to type inference and > > to RexNode. Flnk-planner and blink-planner have a lot of repetitive > code > > and logic. > > - Let TableApi and Cacite definitions consistent. > > - Reduce the complexity of Function development. > > - Powerful Function for user. > > > > Key changes can be summarized as follows: > > - Improve the interface of FunctionDefinition. > > - Introduce type inference for built-in functions. > > - Introduce ExpressionConverter to convert Expression to calcite > > RexNode. > > - Remove repetitive code and logic in planners. > > > > I also listed type inference and behavior of all built-in functions [5], > to > > verify that the interface is satisfied. After introduce type inference to > > table-common module, planners should have a unified function behavior. > > And this gives the community also the chance to quickly discuss types > > and behavior of functions a last time before they are declared stable. > > > > Looking forward to your feedbacks. Thank you. > > > > [1] > https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing > > [2] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design > > [3] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-32%3A+Restructure+flink-table+for+future+contributions > > [4] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System > > [5] > https://docs.google.com/document/d/1fyVmdGgbO1XmIyQ1BaoG_h5BcNcF3q9UJ1Bj1euO240/edit?usp=sharing > > > > Best, > > Jingsong Lee > > >
Re: [DISCUSS] FLIP-51: Rework of the Expression Design
Hi jark: I'll add a chapter to list blink planner extended functions. Best, Jingsong Lee -- From:Jark Wu Send Time:2019年8月15日(星期四) 05:12 To:dev Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design Thanks Jingsong for starting the discussion. The general design of the FLIP looks good to me. +1 for the FLIP. It's time to get rid of the old Expression! Regarding to the function behavior, shall we also include new functions from blink planner (e.g. LISTAGG, REGEXP, TO_DATE, etc..) ? Best, Jark On Wed, 14 Aug 2019 at 23:34, Timo Walther wrote: > Hi Jingsong, > > thanks for writing down this FLIP. Big +1 from my side to finally get > rid of PlannerExpressions and have consistent and well-defined behavior > for Table API and SQL updated to FLIP-37. > > We might need to discuss some of the behavior of particular functions > but this should not affect the actual FLIP-51. > > Regards, > Timo > > > Am 13.08.19 um 12:55 schrieb JingsongLee: > > Hi everyone, > > > > We would like to start a discussion thread on "FLIP-51: Rework of the > > Expression Design"(Design doc: [1], FLIP: [2]), where we describe how > > to improve the new java Expressions to work with type inference and > > convert expression to the calcite RexNode. This is a follow-up plan > > for FLIP-32[3] and FLIP-37[4]. This FLIP is mostly based on FLIP-37. > > > > This FLIP addresses several shortcomings of current: > > - New Expressions still use PlannerExpressions to type inference and > > to RexNode. Flnk-planner and blink-planner have a lot of repetitive > code > > and logic. > > - Let TableApi and Cacite definitions consistent. > > - Reduce the complexity of Function development. > > - Powerful Function for user. > > > > Key changes can be summarized as follows: > > - Improve the interface of FunctionDefinition. > > - Introduce type inference for built-in functions. > > - Introduce ExpressionConverter to convert Expression to calcite > > RexNode. > > - Remove repetitive code and logic in planners. > > > > I also listed type inference and behavior of all built-in functions [5], > to > > verify that the interface is satisfied. After introduce type inference to > > table-common module, planners should have a unified function behavior. > > And this gives the community also the chance to quickly discuss types > > and behavior of functions a last time before they are declared stable. > > > > Looking forward to your feedbacks. Thank you. > > > > [1] > https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing > > [2] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design > > [3] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-32%3A+Restructure+flink-table+for+future+contributions > > [4] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System > > [5] > https://docs.google.com/document/d/1fyVmdGgbO1XmIyQ1BaoG_h5BcNcF3q9UJ1Bj1euO240/edit?usp=sharing > > > > Best, > > Jingsong Lee > > >
[jira] [Created] (FLINK-13734) Support DDL in SQL CLI
Jark Wu created FLINK-13734: --- Summary: Support DDL in SQL CLI Key: FLINK-13734 URL: https://issues.apache.org/jira/browse/FLINK-13734 Project: Flink Issue Type: New Feature Components: Table SQL / Client Reporter: Jark Wu We have supported DDL in TableEnvironment. We should also support to execute DDL on SQL client to make the feature to be used more easily. However, this might need to modify the current architecture of SQL Client. More detailed design should be attached and discussed. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] FLIP-52: Remove legacy Program interface.
+1 Thanks Kostas for pushing this. Thanks, Biao /'bɪ.aʊ/ On Thu, 15 Aug 2019 at 16:03, Kostas Kloudas wrote: > Thanks a lot for the quick response! > I will consider the Flink Accepted and will start working on it. > > Cheers, > Kostas > > On Thu, Aug 15, 2019 at 5:29 AM SHI Xiaogang > wrote: > > > > +1 > > > > Glad that programming with flink becomes simpler and easier. > > > > Regards, > > Xiaogang > > > > Aljoscha Krettek 于2019年8月14日周三 下午11:31写道: > > > > > +1 (for the same reasons I posted on the other thread) > > > > > > > On 14. Aug 2019, at 15:03, Zili Chen wrote: > > > > > > > > +1 > > > > > > > > It could be regarded as part of Flink client api refactor. > > > > Removal of stale code paths helps reason refactor. > > > > > > > > There is one thing worth attention that in this thread[1] Thomas > > > > suggests an interface with a method return JobGraph based on the > > > > fact that REST API and in per job mode actually extracts the JobGraph > > > > from user program and submit it instead of running user program and > > > > submission happens inside the program in session scenario. > > > > > > > > Such an interface would be like > > > > > > > > interface Program { > > > > JobGraph getJobGraph(args); > > > > } > > > > > > > > Anyway, the discussion above could be continued in that thread. > > > > Current Program is a legacy class that quite less useful than it > should > > > be. > > > > > > > > Best, > > > > tison. > > > > > > > > [1] > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/REST-API-JarRunHandler-More-flexibility-for-launching-jobs-td31026.html#a31168 > > > > > > > > > > > > Stephan Ewen 于2019年8月14日周三 下午7:50写道: > > > > > > > >> +1 > > > >> > > > >> the "main" method is the overwhelming default. getting rid of "two > ways > > > to > > > >> do things" is a good idea. > > > >> > > > >> On Wed, Aug 14, 2019 at 1:42 PM Kostas Kloudas > > > wrote: > > > >> > > > >>> Hi all, > > > >>> > > > >>> As discussed in [1] , the Program interface seems to be outdated > and > > > >>> there seems to be > > > >>> no objection to remove it. > > > >>> > > > >>> Given that this interface is PublicEvolving, its removal should > pass > > > >>> through a FLIP and > > > >>> this discussion and the associated FLIP are exactly for that > purpose. > > > >>> > > > >>> Please let me know what you think and if it is ok to proceed with > its > > > >>> removal. > > > >>> > > > >>> Cheers, > > > >>> Kostas > > > >>> > > > >>> link to FLIP-52 : > > > >>> > > > >> > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125308637 > > > >>> > > > >>> [1] > > > >>> > > > >> > > > > https://lists.apache.org/x/thread.html/7ffc9936a384b891dbcf0a481d26c6d13b2125607c200577780d1e18@%3Cdev.flink.apache.org%3E > > > >>> > > > >> > > > > > > >
Re: [DISCUSS] FLIP-52: Remove legacy Program interface.
Thanks a lot for the quick response! I will consider the Flink Accepted and will start working on it. Cheers, Kostas On Thu, Aug 15, 2019 at 5:29 AM SHI Xiaogang wrote: > > +1 > > Glad that programming with flink becomes simpler and easier. > > Regards, > Xiaogang > > Aljoscha Krettek 于2019年8月14日周三 下午11:31写道: > > > +1 (for the same reasons I posted on the other thread) > > > > > On 14. Aug 2019, at 15:03, Zili Chen wrote: > > > > > > +1 > > > > > > It could be regarded as part of Flink client api refactor. > > > Removal of stale code paths helps reason refactor. > > > > > > There is one thing worth attention that in this thread[1] Thomas > > > suggests an interface with a method return JobGraph based on the > > > fact that REST API and in per job mode actually extracts the JobGraph > > > from user program and submit it instead of running user program and > > > submission happens inside the program in session scenario. > > > > > > Such an interface would be like > > > > > > interface Program { > > > JobGraph getJobGraph(args); > > > } > > > > > > Anyway, the discussion above could be continued in that thread. > > > Current Program is a legacy class that quite less useful than it should > > be. > > > > > > Best, > > > tison. > > > > > > [1] > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/REST-API-JarRunHandler-More-flexibility-for-launching-jobs-td31026.html#a31168 > > > > > > > > > Stephan Ewen 于2019年8月14日周三 下午7:50写道: > > > > > >> +1 > > >> > > >> the "main" method is the overwhelming default. getting rid of "two ways > > to > > >> do things" is a good idea. > > >> > > >> On Wed, Aug 14, 2019 at 1:42 PM Kostas Kloudas > > wrote: > > >> > > >>> Hi all, > > >>> > > >>> As discussed in [1] , the Program interface seems to be outdated and > > >>> there seems to be > > >>> no objection to remove it. > > >>> > > >>> Given that this interface is PublicEvolving, its removal should pass > > >>> through a FLIP and > > >>> this discussion and the associated FLIP are exactly for that purpose. > > >>> > > >>> Please let me know what you think and if it is ok to proceed with its > > >>> removal. > > >>> > > >>> Cheers, > > >>> Kostas > > >>> > > >>> link to FLIP-52 : > > >>> > > >> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125308637 > > >>> > > >>> [1] > > >>> > > >> > > https://lists.apache.org/x/thread.html/7ffc9936a384b891dbcf0a481d26c6d13b2125607c200577780d1e18@%3Cdev.flink.apache.org%3E > > >>> > > >> > > > >
Re: Checkpointing under backpressure
Hi, Very thanks for the great points! For the prioritizing inputs, from another point of view, I think it might not cause other bad effects, since we do not need to totally block the channels that have seen barriers after the operator has taking snapshot. After the snapshotting, if the channels that has not seen barriers have buffers, we could first logging and processing these buffers and if they do not have buffers, we can still processing the buffers from the channels that has seen barriers. Therefore, It seems prioritizing inputs should be able to accelerate the checkpoint without other bad effects. and @zhijiangFor making the unaligned checkpoint the only mechanism for all cases, I still think we should allow a configurable timeout after receiving the first barrier so that the channels may get "drained" during the timeout, as pointed out by Stephan. With such a timeout, we are very likely not need to snapshot the input buffers, which would be very similar to the current aligned checkpoint mechanism. Best, Yun -- From:zhijiang Send Time:2019 Aug. 15 (Thu.) 02:22 To:dev Subject:Re: Checkpointing under backpressure > For the checkpoint to complete, any buffer that > arrived prior to the barrier would be to be part of the checkpointed state. Yes, I agree. > So wouldn't it be important to finish persisting these buffers as fast as > possible by prioritizing respective inputs? The task won't be able to > process records from the inputs that have seen the barrier fast when it is > already backpressured (or causing the backpressure). My previous understanding of prioritizing inputs is from task processing aspect after snapshot state. If from the persisting buffers aspect, I think it might be up to how we implement it. If we only tag/reference which buffers in inputs be the part of state, and make the real persisting work is done in async way. That means the already tagged buffers could be processed by task w/o priority. And only after all the persisting work done, the task would report to coordinator of finished checkpoint on its side. The key point is how we implement to make task could continue processing buffers as soon as possible. Thanks for the further explannation of requirements for speeding up checkpoints in backpressure scenario. To make the savepoint finish quickly and then tune the setting to avoid backpressure is really a pratical case. I think this solution could cover this concern. Best, Zhijiang -- From:Thomas Weise Send Time:2019年8月14日(星期三) 19:48 To:dev ; zhijiang Subject:Re: Checkpointing under backpressure --> On Wed, Aug 14, 2019 at 10:23 AM zhijiang wrote: > Thanks for these great points and disccusions! > > 1. Considering the way of triggering checkpoint RPC calls to all the tasks > from Chandy Lamport, it combines two different mechanisms together to make > sure that the trigger could be fast in different scenarios. > But in flink world it might be not very worth trying that way, just as > Stephan's analysis for it. Another concern is that it might bring more > heavy loads for JobMaster broadcasting this checkpoint RPC to all the tasks > in large scale job, especially for the very short checkpoint interval. > Furthermore it would also cause other important RPC to be executed delay to > bring potentail timeout risks. > > 2. I agree with the idea of drawing on the way "take state snapshot on > first barrier" from Chandy Lamport instead of barrier alignment combining > with unaligned checkpoints in flink. > > > The benefit would be less latency increase in the channels which > already have received barriers. > > However, as mentioned before, not prioritizing the inputs from > which barriers are still missing can also have an adverse effect. > > I think we will not have an adverse effect if not prioritizing the inputs > w/o barriers in this case. After sync snapshot, the task could actually > process any input channels. For the input channel receiving the first > barrier, we already have the obvious boundary for persisting buffers. For > other channels w/o barriers we could persist the following buffers for > these channels until barrier arrives in network. Because based on the > credit based flow control, the barrier does not need credit to transport, > then as long as the sender overtakes the barrier accross the output queue, > the network stack would transport this barrier immediately no matter with > the inputs condition on receiver side. So there is no requirements to > consume accumulated buffers in these channels for higher priority. If so it > seems that we will not waste any CPU cycles as Piotr concerns before. > I'm not sure I follow this. For the checkpoint to complete, any buffer that arrived prior to the barrier would be to be part of the checkpointed state. So wouldn't it be important to
[DISCUSS] Reducing build times
Hello everyone, improving our build times is a hot topic at the moment so let's discuss the different ways how they could be reduced. Current state: First up, let's look at some numbers: 1 full build currently consumes 5h of build time total ("total time"), and in the ideal case takes about 1h20m ("run time") to complete from start to finish. The run time may fluctuate of course depending on the current Travis load. This applies both to builds on the Apache and flink-ci Travis. At the time of writing, the current queue time for PR jobs (reminder: running on flink-ci) is about 30 minutes (which basically means that we are processing builds at the rate that they come in), however we are in an admittedly quiet period right now. 2 weeks ago the queue times on flink-ci peaked at around 5-6h as everyone was scrambling to get their changes merged in time for the feature freeze. (Note: Recently optimizations where added to ci-bot where pending builds are canceled if a new commit was pushed to the PR or the PR was closed, which should prove especially useful during the rush hours we see before feature-freezes.) Past approaches Over the years we have done rather few things to improve this situation (hence our current predicament). Beyond the sporadic speedup of some tests, the only notable reduction in total build times was the introduction of cron jobs, which consolidated the per-commit matrix from 4 configurations (different scala/hadoop versions) to 1. The separation into multiple build profiles was only a work-around for the 50m limit on Travis. Running tests in parallel has the obvious potential of reducing run time, but we're currently hitting a hard limit since a few modules (flink-tests, flink-runtime, flink-table-planner-blink) are so loaded with tests that they nearly consume an entire profile by themselves (and thus no further splitting is possible). The rework that introduced stages, at the time of introduction, did also not provide a speed up, although this changed slightly once more profiles were added and some optimizations to the caching have been made. Very recently we modified the surefire-plugin configuration for flink-table-planner-blink to reuse JVM forks for IT cases, providing a significant speedup (18 minutes!). So far we have not seen any negative consequences. Suggestions This is a list of /all /suggestions for reducing run/total times that I have seen recently (in other words, they aren't necessarily mine nor may I agree with all of them). 1. Enable JVM reuse for IT cases in more modules. * We've seen significant speedups in the blink planner, and this should be applicable for all modules. However, I presume there's a reason why we disabled JVM reuse (information on this would be appreciated) 2. Custom differential build scripts * Setup custom scripts for determining which modules might be affected by change, and manipulate the splits accordingly. This approach is conceptually quite straight-forward, but has limits since it has to be pessimistic; i.e. a change in flink-core _must_ result in testing all modules. 3. Only run smoke tests when PR is opened, run heavy tests on demand. * With the introduction of the ci-bot we now have significantly more options on how to handle PR builds. One option could be to only run basic tests when the PR is created (which may be only modified modules, or all unit tests, or another low-cost scheme), and then have a committer trigger other builds (full test run, e2e tests, etc...) on demand. 4. Move more tests into cron builds * The budget version of 3); move certain tests that are either expensive (like some runtime tests that take minutes) or in rarely modified modules (like gelly) into cron jobs. 5. Gradle * Gradle was brought up a few times for it's built-in support for differential builds; basically providing 2) without the overhead of maintaining additional scripts. * To date no PoC was provided that shows it working in our CI environment (i.e., handling splits & caching etc). * This is the most disruptive change by a fair margin, as it would affect the entire project, developers and potentially users (f they build from source). 6. CI service * Our current artifact caching setup on Travis is basically a hack; we're basically abusing the Travis cache, which is meant for long-term caching, to ship build artifacts across jobs. It's brittle at times due to timing/visibility issues and on branches the cleanup processes can interfere with running builds. It is also not as effective as it could be. * There are CI services that provide build artifact caching out of the box, which could be useful for us. * To date, no PoC for using another CI service has been pr
Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer
Congrats Andrey! Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao : > Congratulations Andrey, well deserved! > > Best, > Gary > > On Thu, Aug 15, 2019 at 7:50 AM Bowen Li wrote: > > > Congratulations Andrey! > > > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong wrote: > > > >> Congratulations Andrey! > >> > >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok wrote: > >> > >> > Congratulations Andrey! > >> > At 2019-08-14 21:26:37, "Till Rohrmann" wrote: > >> > >Hi everyone, > >> > > > >> > >I'm very happy to announce that Andrey Zagrebin accepted the offer of > >> the > >> > >Flink PMC to become a committer of the Flink project. > >> > > > >> > >Andrey has been an active community member for more than 15 months. > He > >> has > >> > >helped shaping numerous features such as State TTL, FRocksDB release, > >> > >Shuffle service abstraction, FLIP-1, result partition management and > >> > >various fixes/improvements. He's also frequently helping out on the > >> > >user@f.a.o mailing lists. > >> > > > >> > >Congratulations Andrey! > >> > > > >> > >Best, Till > >> > >(on behalf of the Flink PMC) > >> > > >> > > >
Re: Checkpointing under backpressure
@Thomas just to double check: - parallelism and configuration changes should be well possible on unaligned checkpoints - changes in state types and JobGraph structure would be tricky, and changing the on-the-wire types would not be possible. On Wed, Aug 14, 2019 at 7:48 PM Thomas Weise wrote: > --> > > On Wed, Aug 14, 2019 at 10:23 AM zhijiang > wrote: > > > Thanks for these great points and disccusions! > > > > 1. Considering the way of triggering checkpoint RPC calls to all the > tasks > > from Chandy Lamport, it combines two different mechanisms together to > make > > sure that the trigger could be fast in different scenarios. > > But in flink world it might be not very worth trying that way, just as > > Stephan's analysis for it. Another concern is that it might bring more > > heavy loads for JobMaster broadcasting this checkpoint RPC to all the > tasks > > in large scale job, especially for the very short checkpoint interval. > > Furthermore it would also cause other important RPC to be executed delay > to > > bring potentail timeout risks. > > > > 2. I agree with the idea of drawing on the way "take state snapshot on > > first barrier" from Chandy Lamport instead of barrier alignment combining > > with unaligned checkpoints in flink. > > > > > The benefit would be less latency increase in the channels which > > already have received barriers. > > > However, as mentioned before, not prioritizing the inputs from > > which barriers are still missing can also have an adverse effect. > > > > I think we will not have an adverse effect if not prioritizing the inputs > > w/o barriers in this case. After sync snapshot, the task could actually > > process any input channels. For the input channel receiving the first > > barrier, we already have the obvious boundary for persisting buffers. For > > other channels w/o barriers we could persist the following buffers for > > these channels until barrier arrives in network. Because based on the > > credit based flow control, the barrier does not need credit to transport, > > then as long as the sender overtakes the barrier accross the output > queue, > > the network stack would transport this barrier immediately no matter with > > the inputs condition on receiver side. So there is no requirements to > > consume accumulated buffers in these channels for higher priority. If so > it > > seems that we will not waste any CPU cycles as Piotr concerns before. > > > > I'm not sure I follow this. For the checkpoint to complete, any buffer that > arrived prior to the barrier would be to be part of the checkpointed state. > So wouldn't it be important to finish persisting these buffers as fast as > possible by prioritizing respective inputs? The task won't be able to > process records from the inputs that have seen the barrier fast when it is > already backpressured (or causing the backpressure). > > > > > > 3. Suppose the unaligned checkpoints performing well in practice, is it > > possible to make it as the only mechanism for handling all the cases? I > > mean for the non-backpressure scenario, there are less buffers even empty > > in input/output queue, then the "overtaking barrier--> trigger snapshot > on > > first barrier--> persist buffers" might still work well. So we do not > need > > to maintain two suits of mechanisms finally. > > > > 4. The initial motivation of this dicussion is for checkpoint timeout in > > backpressure scenario. If we adjust the default timeout to a very big > > value, that means the checkpoint would never timeout and we only need to > > wait it finish. Then are there still any other problems/concerns if > > checkpoint takes long time to finish? Althougn we already knew some > issues > > before, it is better to gather more user feedbacks to confirm which > aspects > > could be solved in this feature design. E.g. the sink commit delay might > > not be coverd by unaligned solution. > > > > Checkpoints taking too long is the concern that sparks this discussion > (timeout is just a symptom). The slowness issue also applies to the > savepoint use case. We would need to be able to take a savepoint fast in > order to roll forward a fix that can alleviate the backpressure (like > changing parallelism or making a different configuration change). > > > > > > Best, > > Zhijiang > > -- > > From:Stephan Ewen > > Send Time:2019年8月14日(星期三) 17:43 > > To:dev > > Subject:Re: Checkpointing under backpressure > > > > Quick note: The current implementation is > > > > Align -> Forward -> Sync Snapshot Part (-> Async Snapshot Part) > > > > On Wed, Aug 14, 2019 at 5:21 PM Piotr Nowojski > > wrote: > > > > > > Thanks for the great ideas so far. > > > > > > +1 > > > > > > Regarding other things raised, I mostly agree with Stephan. > > > > > > I like the idea of simultaneously starting the checkpoint everywhere > via > > > RPC call (especially in cases where Tasks are busy doing some > s