date:20190815

[jira] [Created] (FLINK-13744) Improper error message when submit flink job in yarn-cluster mode without hadoop lib bundled

2019-08-15 Thread Jeff Zhang (JIRA)

Jeff Zhang created FLINK-13744:
--

 Summary: Improper error message when submit flink job in 
yarn-cluster mode without hadoop lib bundled
 Key: FLINK-13744
 URL: https://issues.apache.org/jira/browse/FLINK-13744
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client
Affects Versions: 1.9.0
Reporter: Jeff Zhang


Here's the error message I see when I submit flink job in yarn-cluster mode. 
The root cause is that I didn't build flink with profile include-hadoop. But 
the error message here doesn't tell user what's wrong here. We should improve 
the error message in this case.

 
{code:java}
➜  build-target git:(FLINK-13415) ✗ bin/flink run -m yarn-cluster 
examples/batch/WordCount.jar


 The program finished with the following exception:

java.lang.RuntimeException: Could not identify hostname and port in 
'yarn-cluster'.
at 
org.apache.flink.client.ClientUtils.parseHostPortAddress(ClientUtils.java:47)
at 
org.apache.flink.client.cli.AbstractCustomCommandLine.applyCommandLineOptionsToConfiguration(AbstractCustomCommandLine.java:83)
at 
org.apache.flink.client.cli.DefaultCLI.createClusterDescriptor(DefaultCLI.java:60)
at 
org.apache.flink.client.cli.DefaultCLI.createClusterDescriptor(DefaultCLI.java:35)
at 
org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:216)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (FLINK-13743) Port PythonTableUtils to flink-python module

2019-08-15 Thread Dian Fu (JIRA)

Dian Fu created FLINK-13743:
---

 Summary: Port PythonTableUtils to flink-python module
 Key: FLINK-13743
 URL: https://issues.apache.org/jira/browse/FLINK-13743
 Project: Flink
  Issue Type: Task
  Components: API / Python
Reporter: Dian Fu
 Fix For: 1.10.0


Currently *PythonTableUtils* is located in *flink-table-planner* module, 
however, it is shared by both the flink planner and the blink planner. It makes 
more sense to move it *flink-python* module. It also indicates that we should 
change it from Scala to Java.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Bowen Li

Hi Jark,

Thanks for letting me know that it's been like this in previous releases.
Though I don't think that's the right behavior, it can be discussed for
later release. Thus I retract my -1 for RC2.

Bowen


On Thu, Aug 15, 2019 at 7:49 PM Jark Wu  wrote:

> Hi Bowen,
>
> Thanks for reporting this.
> However, I don't think this is an issue. IMO, it is by design.
> The `tEnv.listUserDefinedFunctions()` in Table API and `show functions;` in
> SQL CLI are intended to return only the registered UDFs, not including
> built-in functions.
> This is also the behavior in previous versions.
>
> Best,
> Jark
>
> On Fri, 16 Aug 2019 at 06:52, Bowen Li  wrote:
>
> > -1 for RC2.
> >
> > I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I
> > think it's a blocker.  The bug means currently if users call
> > `tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru
> > SQL would not be able to see Flink's built-in functions.
> >
> > I'm preparing a fix right now.
> >
> > Bowen
> >
> >
> > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai  >
> > wrote:
> >
> > > Thanks for all the test efforts, verifications and votes so far.
> > >
> > > So far, things are looking good, but we still require one more PMC
> > binding
> > > vote for this RC to be the official release, so I would like to extend
> > the
> > > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> > >
> > > In the meantime, the release notes for 1.9.0 had only just been
> finalized
> > > [1], and could use a few more eyes before closing the vote.
> > > Any help with checking if anything else should be mentioned there
> > regarding
> > > breaking changes / known shortcomings would be appreciated.
> > >
> > > Cheers,
> > > Gordon
> > >
> > > [1] https://github.com/apache/flink/pull/9438
> > >
> > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young  wrote:
> > >
> > > > Great, then I have no other comments on legal check.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler  >
> > > > wrote:
> > > >
> > > > > The licensing items aren't a problem; we don't care about Flink
> > modules
> > > > > in NOTICE files, and we don't have to update the source-release
> > > > > licensing since we don't have a pre-built version of the WebUI in
> the
> > > > > source.
> > > > >
> > > > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > > > After going through the licenses, I found 2 suspicions but not
> sure
> > > if
> > > > > they
> > > > > > are
> > > > > > valid or not.
> > > > > >
> > > > > > 1. flink-state-processing-api is packaged in to flink-dist jar,
> but
> > > not
> > > > > > included in
> > > > > > NOTICE-binary file (the one under the root directory) like other
> > > > modules.
> > > > > > 2. flink-runtime-web distributed some JavaScript dependencies
> > through
> > > > > source
> > > > > > codes, the licenses and NOTICE file were only updated inside the
> > > module
> > > > > of
> > > > > > flink-runtime-web, but not the NOTICE file and licenses directory
> > > which
> > > > > > under
> > > > > > the  root directory.
> > > > > >
> > > > > > Another minor issue I just found is:
> > > > > > FLINK-13558 tries to include table examples to flink-dist, but I
> > > cannot
> > > > > > find it in
> > > > > > the binary distribution of RC2.
> > > > > >
> > > > > > Best,
> > > > > > Kurt
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young 
> > wrote:
> > > > > >
> > > > > >> Hi Gordon & Timo,
> > > > > >>
> > > > > >> Thanks for the feedback, and I agree with it. I will document
> this
> > > in
> > > > > the
> > > > > >> release notes.
> > > > > >>
> > > > > >> Best,
> > > > > >> Kurt
> > > > > >>
> > > > > >>
> > > > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai <
> > > > > tzuli...@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Kurt,
> > > > > >>>
> > > > > >>> With the same argument as before, given that it is mentioned in
> > the
> > > > > >>> release
> > > > > >>> announcement that it is a preview feature, I would not block
> this
> > > > > release
> > > > > >>> because of it.
> > > > > >>> Nevertheless, it would be important to mention this explicitly
> in
> > > the
> > > > > >>> release notes [1].
> > > > > >>>
> > > > > >>> Regards,
> > > > > >>> Gordon
> > > > > >>>
> > > > > >>> [1] https://github.com/apache/flink/pull/9438
> > > > > >>>
> > > > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther <
> > twal...@apache.org>
> > > > > wrote:
> > > > > >>>
> > > > >  Hi Kurt,
> > > > > 
> > > > >  I agree that this is a serious bug. However, I would not block
> > the
> > > > >  release because of this. As you said, there is a workaround
> and
> > > the
> > > > >  `execute()` works in the most common case of a single
> execution.
> > > We
> > > > > can
> > > > >  fix this in a minor release shortly after.
> > > > > 
> > > > >  What do others think?
> > > > > 
> > > > >  Regards,
> > > > >  Ti

[jira] [Created] (FLINK-13742) Fix code generation when aggregation contains both distinct aggregate with and without filter

2019-08-15 Thread Jark Wu (JIRA)

Jark Wu created FLINK-13742:
---

 Summary: Fix code generation when aggregation contains both 
distinct aggregate with and without filter
 Key: FLINK-13742
 URL: https://issues.apache.org/jira/browse/FLINK-13742
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Reporter: Jark Wu
 Fix For: 1.9.1


The following test will fail when the aggregation contains {{COUNT(DISTINCT 
c)}} and {{COUNT(DISTINCT c) filter ...}}.

{code:java}
  @Test
  def testDistinctWithMultiFilter(): Unit = {
val sqlQuery =
  "SELECT b, " +
"  SUM(DISTINCT (a * 3)), " +
"  COUNT(DISTINCT SUBSTRING(c FROM 1 FOR 2))," +
"  COUNT(DISTINCT c)," +
"  COUNT(DISTINCT c) filter (where MOD(a, 3) = 0)," +
"  COUNT(DISTINCT c) filter (where MOD(a, 3) = 1) " +
"FROM MyTable " +
"GROUP BY b"
val t = 
failingDataSource(StreamTestData.get3TupleData).toTable(tEnv).as('a, 'b, 'c)
tEnv.registerTable("MyTable", t)
val result = tEnv.sqlQuery(sqlQuery).toRetractStream[Row]
val sink = new TestingRetractSink
result.addSink(sink)
env.execute()
val expected = List(
  "1,3,1,1,0,1",
  "2,15,1,2,1,0",
  "3,45,3,3,1,1",
  "4,102,1,4,1,2",
  "5,195,1,5,2,1",
  "6,333,1,6,2,2")
assertEquals(expected.sorted, sink.getRetractResults.sorted)
  }
{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Jark Wu

Thanks Chesnay for starting this discussion.

+1 for #1, it might be the easiest way to get a significant speedup.
If the only reason is for isolation. I think we can fix the static fields
or global state used in Flink if possible.

+1 for #2, and thanks Aleksey for the prototype. I think it's a good
approach which doesn't introduce too much things to maintain.

+1 for #3(run CRON or e2e tests on demand).
We have this requirement when reviewing some pull requests, because we
don't sure whether it will broken some specific e2e test.
Currently, we have to run it locally by building the whole project. Or
enable CRON jobs for the pushed branch in contributor's own travis.

Besides that, I think FLINK-11464[1] is also a good way to cache
distributions to save a lot of download time.

Best,
Jark

[1]: https://issues.apache.org/jira/browse/FLINK-11464

On Thu, 15 Aug 2019 at 21:47, Aleksey Pak  wrote:

> Hi all!
>
> Thanks for starting this discussion.
>
> I'd like to also add my 2 cents:
>
> +1 for #2, differential build scripts.
> I've worked on the approach. And with it, I think it's possible to reduce
> total build time with relatively low effort, without enforcing any new
> build tool and low maintenance cost.
>
> You can check a proposed change (for the old CI setup, when Flink PRs were
> running in Apache common CI pool) here:
> https://github.com/apache/flink/pull/9065
> In the proposed change, the dependency check is not heavily hardcoded and
> just uses maven's results for dependency graph analysis.
>
> > This approach is conceptually quite straight-forward, but has limits
> since it has to be pessimistic; > i.e. a change in flink-core _must_ result
> in testing all modules.
>
> Agree, in Flink case, there are some core modules that would trigger whole
> tests run with such approach. For developers who modify such components,
> the build time would be the longest. But this approach should really help
> for developers who touch more-or-less independent modules.
>
> Even for core modules, it's possible to create "abstraction" barriers by
> changing dependency graph. For example, it can look like: flink-core-api
> <-- flink-core, flink-core-api <-- flink-connectors.
> In that case, only change in flink-core-api would trigger whole tests run.
>
> +1 for #3, separating PR CI runs to different stages.
> Imo, it may require more change to current CI setup, compared to #2 and
> better it should not be silly. Best, if it integrates with the Flink bot
> and triggers some follow up build steps only when some prerequisites are
> done.
>
> +1 for #4, to move some tests into cron runs.
> But imo, this does not scale well, it applies only to a small subset of
> tests.
>
> +1 for #6, to use other CI service(s).
> More specifically, GitHub gives build actions for free that can be used to
> offload some build steps/PR checks. It can help to move out some PR checks
> from the main CI build (for example: documentation builds, license checks,
> code formatting checks).
>
> Regards,
> Aleksey
>
> On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann 
> wrote:
>
> > Thanks for starting this discussion Chesnay. I think it has become
> obvious
> > to the Flink community that with the existing build setup we cannot
> really
> > deliver fast build times which are essential for fast iteration cycles
> and
> > high developer productivity. The reasons for this situation are manifold
> > but it is definitely affected by Flink's project growth, not always
> optimal
> > tests and the inflexibility that everything needs to be built. Hence, I
> > consider the reduction of build times crucial for the project's health
> and
> > future growth.
> >
> > Without necessarily voicing a strong preference for any of the presented
> > suggestions, I wanted to comment on each of them:
> >
> > 1. This sounds promising. Could the reason why we don't reuse JVMs date
> > back to the time when we still had a lot of static fields in Flink which
> > made it hard to reuse JVMs and the potentially mutated global state?
> >
> > 2. Building hand-crafted solutions around a build system in order to
> > compensate for its limitations which other build systems support out of
> the
> > box sounds like the not invented here syndrome to me. Reinventing the
> wheel
> > has historically proven to be usually not the best solution and it often
> > comes with a high maintenance price tag. Moreover, it would add just
> > another layer of complexity around our existing build system. I think the
> > current state where we have the maven setup in pom files and for Travis
> > multiple bash scripts specializing the builds to make it fit the time
> limit
> > is already not very transparent/easy to understand.
> >
> > 3. I could see this work but it also requires a very good understanding
> of
> > Flink of every committer because the committer needs to know which tests
> > would be good to run additionally.
> >
> > 4. I would be against this option solely to decrease our build time. My

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Jark Wu

Hi Bowen,

Thanks for reporting this.
However, I don't think this is an issue. IMO, it is by design.
The `tEnv.listUserDefinedFunctions()` in Table API and `show functions;` in
SQL CLI are intended to return only the registered UDFs, not including
built-in functions.
This is also the behavior in previous versions.

Best,
Jark

On Fri, 16 Aug 2019 at 06:52, Bowen Li  wrote:

> -1 for RC2.
>
> I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I
> think it's a blocker.  The bug means currently if users call
> `tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru
> SQL would not be able to see Flink's built-in functions.
>
> I'm preparing a fix right now.
>
> Bowen
>
>
> On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai 
> wrote:
>
> > Thanks for all the test efforts, verifications and votes so far.
> >
> > So far, things are looking good, but we still require one more PMC
> binding
> > vote for this RC to be the official release, so I would like to extend
> the
> > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> >
> > In the meantime, the release notes for 1.9.0 had only just been finalized
> > [1], and could use a few more eyes before closing the vote.
> > Any help with checking if anything else should be mentioned there
> regarding
> > breaking changes / known shortcomings would be appreciated.
> >
> > Cheers,
> > Gordon
> >
> > [1] https://github.com/apache/flink/pull/9438
> >
> > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young  wrote:
> >
> > > Great, then I have no other comments on legal check.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler 
> > > wrote:
> > >
> > > > The licensing items aren't a problem; we don't care about Flink
> modules
> > > > in NOTICE files, and we don't have to update the source-release
> > > > licensing since we don't have a pre-built version of the WebUI in the
> > > > source.
> > > >
> > > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > > After going through the licenses, I found 2 suspicions but not sure
> > if
> > > > they
> > > > > are
> > > > > valid or not.
> > > > >
> > > > > 1. flink-state-processing-api is packaged in to flink-dist jar, but
> > not
> > > > > included in
> > > > > NOTICE-binary file (the one under the root directory) like other
> > > modules.
> > > > > 2. flink-runtime-web distributed some JavaScript dependencies
> through
> > > > source
> > > > > codes, the licenses and NOTICE file were only updated inside the
> > module
> > > > of
> > > > > flink-runtime-web, but not the NOTICE file and licenses directory
> > which
> > > > > under
> > > > > the  root directory.
> > > > >
> > > > > Another minor issue I just found is:
> > > > > FLINK-13558 tries to include table examples to flink-dist, but I
> > cannot
> > > > > find it in
> > > > > the binary distribution of RC2.
> > > > >
> > > > > Best,
> > > > > Kurt
> > > > >
> > > > >
> > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young 
> wrote:
> > > > >
> > > > >> Hi Gordon & Timo,
> > > > >>
> > > > >> Thanks for the feedback, and I agree with it. I will document this
> > in
> > > > the
> > > > >> release notes.
> > > > >>
> > > > >> Best,
> > > > >> Kurt
> > > > >>
> > > > >>
> > > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai <
> > > > tzuli...@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Kurt,
> > > > >>>
> > > > >>> With the same argument as before, given that it is mentioned in
> the
> > > > >>> release
> > > > >>> announcement that it is a preview feature, I would not block this
> > > > release
> > > > >>> because of it.
> > > > >>> Nevertheless, it would be important to mention this explicitly in
> > the
> > > > >>> release notes [1].
> > > > >>>
> > > > >>> Regards,
> > > > >>> Gordon
> > > > >>>
> > > > >>> [1] https://github.com/apache/flink/pull/9438
> > > > >>>
> > > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther <
> twal...@apache.org>
> > > > wrote:
> > > > >>>
> > > >  Hi Kurt,
> > > > 
> > > >  I agree that this is a serious bug. However, I would not block
> the
> > > >  release because of this. As you said, there is a workaround and
> > the
> > > >  `execute()` works in the most common case of a single execution.
> > We
> > > > can
> > > >  fix this in a minor release shortly after.
> > > > 
> > > >  What do others think?
> > > > 
> > > >  Regards,
> > > >  Timo
> > > > 
> > > > 
> > > >  Am 15.08.19 um 11:23 schrieb Kurt Young:
> > > > > HI,
> > > > >
> > > > > We just find a serious bug around blink planner:
> > > > > https://issues.apache.org/jira/browse/FLINK-13708
> > > > > When user reused the table environment instance, and call
> > `execute`
> > > >  method
> > > > > multiple times for
> > > > > different sql, the later call will trigger the earlier ones to
> be
> > > > > re-executed.
> > > > >
> > > > > It's a serious bug but seems we also have a work around, which

Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend

2019-08-15 Thread Congxian Qiu

Big +1 for this feature.

This FLIP can help improves at least the following two scenarios:
- Temporary data peak when using Heap StateBackend
- Heap State Backend has better performance than RocksDBStateBackend,
especially on SATA disk. there are some guys ever told me that they
increased the parallelism of operators(and use HeapStateBackend) other than
use RocksDBStateBackend to get better performance. But increase parallelism
will have some other problems, after this FLIP, we can run Flink Job with
the same parallelism as RocksDBStateBackend and get better performance also.

Best,
Congxian


Yu Li  于2019年8月16日周五 上午12:14写道：

> Thanks all for the reviews and comments!
>
> bq. From the implementation plan, it looks like this exists purely in a new
> module and does not require any changes in other parts of Flink's code. Can
> you confirm that?
> Confirmed, thanks!
>
> Best Regards,
> Yu
>
>
> On Thu, 15 Aug 2019 at 18:04, Tzu-Li (Gordon) Tai 
> wrote:
>
> > +1 to start a VOTE for this FLIP.
> >
> > Given the properties of this new state backend and that it will exist as
> a
> > new module without touching the original heap backend, I don't see a harm
> > in including this.
> > Regarding design of the feature, I've already mentioned my comments in
> the
> > original discussion thread.
> >
> > Cheers,
> > Gordon
> >
> > On Thu, Aug 15, 2019 at 5:53 PM Yun Tang  wrote:
> >
> > > Big +1 for this feature.
> > >
> > > Our customers including me, have ever met dilemma where we have to use
> > > window to aggregate events in applications like real-time monitoring.
> The
> > > larger of timer and window state, the poor performance of RocksDB.
> > However,
> > > switching to use FsStateBackend would always make me feel fear about
> the
> > > OOM errors.
> > >
> > > Look forward for more powerful enrichment to state-backend, and help
> > Flink
> > > to achieve better performance together.
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Stephan Ewen 
> > > Sent: Thursday, August 15, 2019 23:07
> > > To: dev 
> > > Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
> > >
> > > +1 for this feature. I think this will be appreciated by users, as a
> way
> > to
> > > use the HeapStateBackend with a safety-net against OOM errors.
> > > And having had major production exposure is great.
> > >
> > > From the implementation plan, it looks like this exists purely in a new
> > > module and does not require any changes in other parts of Flink's code.
> > Can
> > > you confirm that?
> > >
> > > Other that that, I have no further questions and we could proceed to
> vote
> > > on this FLIP, from my side.
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > > On Tue, Aug 13, 2019 at 10:00 PM Yu Li  wrote:
> > >
> > > > Sorry for forgetting to give the link of the FLIP, here it is:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend
> > > >
> > > > Thanks!
> > > >
> > > > Best Regards,
> > > > Yu
> > > >
> > > >
> > > > On Tue, 13 Aug 2019 at 18:06, Yu Li  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We ever held a discussion about this feature before [1] but now
> > opening
> > > > > another thread because after a second thought introducing a new
> > backend
> > > > > instead of modifying the existing heap backend is a better option
> to
> > > > > prevent causing any regression or surprise to existing
> in-production
> > > > usage.
> > > > > And since introducing a new backend is relatively big change, we
> > regard
> > > > it
> > > > > as a FLIP and need another discussion and voting process according
> to
> > > our
> > > > > newly drafted bylaw [2].
> > > > >
> > > > > Please allow me to quote the brief description from the old thread
> > [1]
> > > > for
> > > > > the convenience of those who noticed this feature for the first
> time:
> > > > >
> > > > >
> > > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in
> Flink,
> > > > > since state lives as Java objects on the heap in
> > HeapKeyedStateBackend
> > > > and
> > > > > the de/serialization only happens during state snapshot and
> restore,
> > it
> > > > > outperforms RocksDBKeyeStateBackend when all data could reside in
> > > > memory.**However,
> > > > > along with the advantage, HeapKeyedStateBackend also has its
> > > > shortcomings,
> > > > > and the most painful one is the difficulty to estimate the maximum
> > heap
> > > > > size (Xmx) to set, and we will suffer from GC impact once the heap
> > > memory
> > > > > is not enough to hold all state data. There’re several (inevitable)
> > > > causes
> > > > > for such scenario, including (but not limited to):*
> > > > >
> > > > >
> > > > >
> > > > > ** Memory overhead of Java object representation (tens of times of
> > the
> > > > > serialized data size).* Data flood caused by burst traffic.* Data
> > > > > accumulation caused by source malfunction.**To resolve this
> problem,
> > we

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Bowen Li

-1 for RC2.

I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I
think it's a blocker.  The bug means currently if users call
`tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru
SQL would not be able to see Flink's built-in functions.

I'm preparing a fix right now.

Bowen


On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai 
wrote:

> Thanks for all the test efforts, verifications and votes so far.
>
> So far, things are looking good, but we still require one more PMC binding
> vote for this RC to be the official release, so I would like to extend the
> vote time for 1 more day, until *Aug. 16th 17:00 CET*.
>
> In the meantime, the release notes for 1.9.0 had only just been finalized
> [1], and could use a few more eyes before closing the vote.
> Any help with checking if anything else should be mentioned there regarding
> breaking changes / known shortcomings would be appreciated.
>
> Cheers,
> Gordon
>
> [1] https://github.com/apache/flink/pull/9438
>
> On Thu, Aug 15, 2019 at 3:58 PM Kurt Young  wrote:
>
> > Great, then I have no other comments on legal check.
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler 
> > wrote:
> >
> > > The licensing items aren't a problem; we don't care about Flink modules
> > > in NOTICE files, and we don't have to update the source-release
> > > licensing since we don't have a pre-built version of the WebUI in the
> > > source.
> > >
> > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > After going through the licenses, I found 2 suspicions but not sure
> if
> > > they
> > > > are
> > > > valid or not.
> > > >
> > > > 1. flink-state-processing-api is packaged in to flink-dist jar, but
> not
> > > > included in
> > > > NOTICE-binary file (the one under the root directory) like other
> > modules.
> > > > 2. flink-runtime-web distributed some JavaScript dependencies through
> > > source
> > > > codes, the licenses and NOTICE file were only updated inside the
> module
> > > of
> > > > flink-runtime-web, but not the NOTICE file and licenses directory
> which
> > > > under
> > > > the  root directory.
> > > >
> > > > Another minor issue I just found is:
> > > > FLINK-13558 tries to include table examples to flink-dist, but I
> cannot
> > > > find it in
> > > > the binary distribution of RC2.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:
> > > >
> > > >> Hi Gordon & Timo,
> > > >>
> > > >> Thanks for the feedback, and I agree with it. I will document this
> in
> > > the
> > > >> release notes.
> > > >>
> > > >> Best,
> > > >> Kurt
> > > >>
> > > >>
> > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai <
> > > tzuli...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Kurt,
> > > >>>
> > > >>> With the same argument as before, given that it is mentioned in the
> > > >>> release
> > > >>> announcement that it is a preview feature, I would not block this
> > > release
> > > >>> because of it.
> > > >>> Nevertheless, it would be important to mention this explicitly in
> the
> > > >>> release notes [1].
> > > >>>
> > > >>> Regards,
> > > >>> Gordon
> > > >>>
> > > >>> [1] https://github.com/apache/flink/pull/9438
> > > >>>
> > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther 
> > > wrote:
> > > >>>
> > >  Hi Kurt,
> > > 
> > >  I agree that this is a serious bug. However, I would not block the
> > >  release because of this. As you said, there is a workaround and
> the
> > >  `execute()` works in the most common case of a single execution.
> We
> > > can
> > >  fix this in a minor release shortly after.
> > > 
> > >  What do others think?
> > > 
> > >  Regards,
> > >  Timo
> > > 
> > > 
> > >  Am 15.08.19 um 11:23 schrieb Kurt Young:
> > > > HI,
> > > >
> > > > We just find a serious bug around blink planner:
> > > > https://issues.apache.org/jira/browse/FLINK-13708
> > > > When user reused the table environment instance, and call
> `execute`
> > >  method
> > > > multiple times for
> > > > different sql, the later call will trigger the earlier ones to be
> > > > re-executed.
> > > >
> > > > It's a serious bug but seems we also have a work around, which is
> > > >>> never
> > > > reuse the table environment
> > > > object. I'm not sure if we should treat this one as blocker issue
> > of
> > >  1.9.0.
> > > > What's your opinion?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao 
> > wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >> Jepsen test suite passed 10 times consecutively
> > > >>
> > > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <
> > > >>> aljos...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> I did some testing on a Google Cloud Dataproc cluster (it gives
> >

[jira] [Created] (FLINK-13741) FunctionCatalog.getUserDefinedFunctions() does not return Flink built-in functions' names

2019-08-15 Thread Bowen Li (JIRA)

Bowen Li created FLINK-13741:


 Summary: FunctionCatalog.getUserDefinedFunctions() does not return 
Flink built-in functions' names
 Key: FLINK-13741
 URL: https://issues.apache.org/jira/browse/FLINK-13741
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / API
Affects Versions: 1.9.0
Reporter: Bowen Li
Assignee: Bowen Li
 Fix For: 1.9.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (FLINK-13740) TableAggregateITCase.testNonkeyedFlatAggregate failed on Travis

2019-08-15 Thread Till Rohrmann (JIRA)

Till Rohrmann created FLINK-13740:
-

 Summary: TableAggregateITCase.testNonkeyedFlatAggregate failed on 
Travis
 Key: FLINK-13740
 URL: https://issues.apache.org/jira/browse/FLINK-13740
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


The {{TableAggregateITCase.testNonkeyedFlatAggregate}} failed on Travis with 

{code}
org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at 
org.apache.flink.table.planner.runtime.stream.table.TableAggregateITCase.testNonkeyedFlatAggregate(TableAggregateITCase.scala:93)
Caused by: java.lang.Exception: Artificial Failure
{code}

https://api.travis-ci.com/v3/job/225551182/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend

2019-08-15 Thread Yu Li

Thanks all for the reviews and comments!

bq. From the implementation plan, it looks like this exists purely in a new
module and does not require any changes in other parts of Flink's code. Can
you confirm that?
Confirmed, thanks!

Best Regards,
Yu


On Thu, 15 Aug 2019 at 18:04, Tzu-Li (Gordon) Tai 
wrote:

> +1 to start a VOTE for this FLIP.
>
> Given the properties of this new state backend and that it will exist as a
> new module without touching the original heap backend, I don't see a harm
> in including this.
> Regarding design of the feature, I've already mentioned my comments in the
> original discussion thread.
>
> Cheers,
> Gordon
>
> On Thu, Aug 15, 2019 at 5:53 PM Yun Tang  wrote:
>
> > Big +1 for this feature.
> >
> > Our customers including me, have ever met dilemma where we have to use
> > window to aggregate events in applications like real-time monitoring. The
> > larger of timer and window state, the poor performance of RocksDB.
> However,
> > switching to use FsStateBackend would always make me feel fear about the
> > OOM errors.
> >
> > Look forward for more powerful enrichment to state-backend, and help
> Flink
> > to achieve better performance together.
> >
> > Best
> > Yun Tang
> > 
> > From: Stephan Ewen 
> > Sent: Thursday, August 15, 2019 23:07
> > To: dev 
> > Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
> >
> > +1 for this feature. I think this will be appreciated by users, as a way
> to
> > use the HeapStateBackend with a safety-net against OOM errors.
> > And having had major production exposure is great.
> >
> > From the implementation plan, it looks like this exists purely in a new
> > module and does not require any changes in other parts of Flink's code.
> Can
> > you confirm that?
> >
> > Other that that, I have no further questions and we could proceed to vote
> > on this FLIP, from my side.
> >
> > Best,
> > Stephan
> >
> >
> > On Tue, Aug 13, 2019 at 10:00 PM Yu Li  wrote:
> >
> > > Sorry for forgetting to give the link of the FLIP, here it is:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend
> > >
> > > Thanks!
> > >
> > > Best Regards,
> > > Yu
> > >
> > >
> > > On Tue, 13 Aug 2019 at 18:06, Yu Li  wrote:
> > >
> > > > Hi All,
> > > >
> > > > We ever held a discussion about this feature before [1] but now
> opening
> > > > another thread because after a second thought introducing a new
> backend
> > > > instead of modifying the existing heap backend is a better option to
> > > > prevent causing any regression or surprise to existing in-production
> > > usage.
> > > > And since introducing a new backend is relatively big change, we
> regard
> > > it
> > > > as a FLIP and need another discussion and voting process according to
> > our
> > > > newly drafted bylaw [2].
> > > >
> > > > Please allow me to quote the brief description from the old thread
> [1]
> > > for
> > > > the convenience of those who noticed this feature for the first time:
> > > >
> > > >
> > > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink,
> > > > since state lives as Java objects on the heap in
> HeapKeyedStateBackend
> > > and
> > > > the de/serialization only happens during state snapshot and restore,
> it
> > > > outperforms RocksDBKeyeStateBackend when all data could reside in
> > > memory.**However,
> > > > along with the advantage, HeapKeyedStateBackend also has its
> > > shortcomings,
> > > > and the most painful one is the difficulty to estimate the maximum
> heap
> > > > size (Xmx) to set, and we will suffer from GC impact once the heap
> > memory
> > > > is not enough to hold all state data. There’re several (inevitable)
> > > causes
> > > > for such scenario, including (but not limited to):*
> > > >
> > > >
> > > >
> > > > ** Memory overhead of Java object representation (tens of times of
> the
> > > > serialized data size).* Data flood caused by burst traffic.* Data
> > > > accumulation caused by source malfunction.**To resolve this problem,
> we
> > > > proposed a solution to support spilling state data to disk before
> heap
> > > > memory is exhausted. We will monitor the heap usage and choose the
> > > coldest
> > > > data to spill, and reload them when heap memory is regained after
> data
> > > > removing or TTL expiration, automatically. Furthermore, *to prevent
> > > > causing unexpected regression to existing usage of
> > HeapKeyedStateBackend,
> > > > we plan to introduce a new SpillableHeapKeyedStateBackend and change
> it
> > > to
> > > > default in future if proven to be stable.
> > > >
> > > > Please let us know your point of the feature and any comment is
> > > > welcomed/appreciated. Thanks.
> > > >
> > > > [1] https://s.apache.org/pxeif
> > > > [2]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> > > >
> > > > Best Regards,
> > > > Yu
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend

2019-08-15 Thread Tzu-Li (Gordon) Tai

+1 to start a VOTE for this FLIP.

Given the properties of this new state backend and that it will exist as a
new module without touching the original heap backend, I don't see a harm
in including this.
Regarding design of the feature, I've already mentioned my comments in the
original discussion thread.

Cheers,
Gordon

On Thu, Aug 15, 2019 at 5:53 PM Yun Tang  wrote:

> Big +1 for this feature.
>
> Our customers including me, have ever met dilemma where we have to use
> window to aggregate events in applications like real-time monitoring. The
> larger of timer and window state, the poor performance of RocksDB. However,
> switching to use FsStateBackend would always make me feel fear about the
> OOM errors.
>
> Look forward for more powerful enrichment to state-backend, and help Flink
> to achieve better performance together.
>
> Best
> Yun Tang
> 
> From: Stephan Ewen 
> Sent: Thursday, August 15, 2019 23:07
> To: dev 
> Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend
>
> +1 for this feature. I think this will be appreciated by users, as a way to
> use the HeapStateBackend with a safety-net against OOM errors.
> And having had major production exposure is great.
>
> From the implementation plan, it looks like this exists purely in a new
> module and does not require any changes in other parts of Flink's code. Can
> you confirm that?
>
> Other that that, I have no further questions and we could proceed to vote
> on this FLIP, from my side.
>
> Best,
> Stephan
>
>
> On Tue, Aug 13, 2019 at 10:00 PM Yu Li  wrote:
>
> > Sorry for forgetting to give the link of the FLIP, here it is:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend
> >
> > Thanks!
> >
> > Best Regards,
> > Yu
> >
> >
> > On Tue, 13 Aug 2019 at 18:06, Yu Li  wrote:
> >
> > > Hi All,
> > >
> > > We ever held a discussion about this feature before [1] but now opening
> > > another thread because after a second thought introducing a new backend
> > > instead of modifying the existing heap backend is a better option to
> > > prevent causing any regression or surprise to existing in-production
> > usage.
> > > And since introducing a new backend is relatively big change, we regard
> > it
> > > as a FLIP and need another discussion and voting process according to
> our
> > > newly drafted bylaw [2].
> > >
> > > Please allow me to quote the brief description from the old thread [1]
> > for
> > > the convenience of those who noticed this feature for the first time:
> > >
> > >
> > > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink,
> > > since state lives as Java objects on the heap in HeapKeyedStateBackend
> > and
> > > the de/serialization only happens during state snapshot and restore, it
> > > outperforms RocksDBKeyeStateBackend when all data could reside in
> > memory.**However,
> > > along with the advantage, HeapKeyedStateBackend also has its
> > shortcomings,
> > > and the most painful one is the difficulty to estimate the maximum heap
> > > size (Xmx) to set, and we will suffer from GC impact once the heap
> memory
> > > is not enough to hold all state data. There’re several (inevitable)
> > causes
> > > for such scenario, including (but not limited to):*
> > >
> > >
> > >
> > > ** Memory overhead of Java object representation (tens of times of the
> > > serialized data size).* Data flood caused by burst traffic.* Data
> > > accumulation caused by source malfunction.**To resolve this problem, we
> > > proposed a solution to support spilling state data to disk before heap
> > > memory is exhausted. We will monitor the heap usage and choose the
> > coldest
> > > data to spill, and reload them when heap memory is regained after data
> > > removing or TTL expiration, automatically. Furthermore, *to prevent
> > > causing unexpected regression to existing usage of
> HeapKeyedStateBackend,
> > > we plan to introduce a new SpillableHeapKeyedStateBackend and change it
> > to
> > > default in future if proven to be stable.
> > >
> > > Please let us know your point of the feature and any comment is
> > > welcomed/appreciated. Thanks.
> > >
> > > [1] https://s.apache.org/pxeif
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> > >
> > > Best Regards,
> > > Yu
> > >
> >
>

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Tzu-Li (Gordon) Tai

Thanks for all the test efforts, verifications and votes so far.

So far, things are looking good, but we still require one more PMC binding
vote for this RC to be the official release, so I would like to extend the
vote time for 1 more day, until *Aug. 16th 17:00 CET*.

In the meantime, the release notes for 1.9.0 had only just been finalized
[1], and could use a few more eyes before closing the vote.
Any help with checking if anything else should be mentioned there regarding
breaking changes / known shortcomings would be appreciated.

Cheers,
Gordon

[1] https://github.com/apache/flink/pull/9438

On Thu, Aug 15, 2019 at 3:58 PM Kurt Young  wrote:

> Great, then I have no other comments on legal check.
>
> Best,
> Kurt
>
>
> On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler 
> wrote:
>
> > The licensing items aren't a problem; we don't care about Flink modules
> > in NOTICE files, and we don't have to update the source-release
> > licensing since we don't have a pre-built version of the WebUI in the
> > source.
> >
> > On 15/08/2019 15:22, Kurt Young wrote:
> > > After going through the licenses, I found 2 suspicions but not sure if
> > they
> > > are
> > > valid or not.
> > >
> > > 1. flink-state-processing-api is packaged in to flink-dist jar, but not
> > > included in
> > > NOTICE-binary file (the one under the root directory) like other
> modules.
> > > 2. flink-runtime-web distributed some JavaScript dependencies through
> > source
> > > codes, the licenses and NOTICE file were only updated inside the module
> > of
> > > flink-runtime-web, but not the NOTICE file and licenses directory which
> > > under
> > > the  root directory.
> > >
> > > Another minor issue I just found is:
> > > FLINK-13558 tries to include table examples to flink-dist, but I cannot
> > > find it in
> > > the binary distribution of RC2.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:
> > >
> > >> Hi Gordon & Timo,
> > >>
> > >> Thanks for the feedback, and I agree with it. I will document this in
> > the
> > >> release notes.
> > >>
> > >> Best,
> > >> Kurt
> > >>
> > >>
> > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai <
> > tzuli...@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Kurt,
> > >>>
> > >>> With the same argument as before, given that it is mentioned in the
> > >>> release
> > >>> announcement that it is a preview feature, I would not block this
> > release
> > >>> because of it.
> > >>> Nevertheless, it would be important to mention this explicitly in the
> > >>> release notes [1].
> > >>>
> > >>> Regards,
> > >>> Gordon
> > >>>
> > >>> [1] https://github.com/apache/flink/pull/9438
> > >>>
> > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther 
> > wrote:
> > >>>
> >  Hi Kurt,
> > 
> >  I agree that this is a serious bug. However, I would not block the
> >  release because of this. As you said, there is a workaround and the
> >  `execute()` works in the most common case of a single execution. We
> > can
> >  fix this in a minor release shortly after.
> > 
> >  What do others think?
> > 
> >  Regards,
> >  Timo
> > 
> > 
> >  Am 15.08.19 um 11:23 schrieb Kurt Young:
> > > HI,
> > >
> > > We just find a serious bug around blink planner:
> > > https://issues.apache.org/jira/browse/FLINK-13708
> > > When user reused the table environment instance, and call `execute`
> >  method
> > > multiple times for
> > > different sql, the later call will trigger the earlier ones to be
> > > re-executed.
> > >
> > > It's a serious bug but seems we also have a work around, which is
> > >>> never
> > > reuse the table environment
> > > object. I'm not sure if we should treat this one as blocker issue
> of
> >  1.9.0.
> > > What's your opinion?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao 
> wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> Jepsen test suite passed 10 times consecutively
> > >>
> > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <
> > >>> aljos...@apache.org>
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> I did some testing on a Google Cloud Dataproc cluster (it gives
> you
> > >>> a
> > >>> managed YARN and Google Cloud Storage (GCS)):
> > >>> - tried both YARN session mode and YARN per-job mode, also
> > using
> > >>> bin/flink list/cancel/etc. against a YARN session cluster
> > >>> - ran examples that write to GCS, both with the native Hadoop
> > >> FileSystem
> > >>> and a custom “plugin” FileSystem
> > >>> - ran stateful streaming jobs that use GCS as a checkpoint
> > >>> backend
> > >>> - tried running SQL programs on YARN using the SQL Cli: this
> > >>> worked
> >  for
> > >>> YARN session mode but not for YARN per-job mode. Looking at the
> > >>> code I
> > >>> don’t thi

Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend

2019-08-15 Thread Yun Tang

Big +1 for this feature.

Our customers including me, have ever met dilemma where we have to use window 
to aggregate events in applications like real-time monitoring. The larger of 
timer and window state, the poor performance of RocksDB. However, switching to 
use FsStateBackend would always make me feel fear about the OOM errors.

Look forward for more powerful enrichment to state-backend, and help Flink to 
achieve better performance together.

Best
Yun Tang

From: Stephan Ewen 
Sent: Thursday, August 15, 2019 23:07
To: dev 
Subject: Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend

+1 for this feature. I think this will be appreciated by users, as a way to
use the HeapStateBackend with a safety-net against OOM errors.
And having had major production exposure is great.

>From the implementation plan, it looks like this exists purely in a new
module and does not require any changes in other parts of Flink's code. Can
you confirm that?

Other that that, I have no further questions and we could proceed to vote
on this FLIP, from my side.

Best,
Stephan


On Tue, Aug 13, 2019 at 10:00 PM Yu Li  wrote:

> Sorry for forgetting to give the link of the FLIP, here it is:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend
>
> Thanks!
>
> Best Regards,
> Yu
>
>
> On Tue, 13 Aug 2019 at 18:06, Yu Li  wrote:
>
> > Hi All,
> >
> > We ever held a discussion about this feature before [1] but now opening
> > another thread because after a second thought introducing a new backend
> > instead of modifying the existing heap backend is a better option to
> > prevent causing any regression or surprise to existing in-production
> usage.
> > And since introducing a new backend is relatively big change, we regard
> it
> > as a FLIP and need another discussion and voting process according to our
> > newly drafted bylaw [2].
> >
> > Please allow me to quote the brief description from the old thread [1]
> for
> > the convenience of those who noticed this feature for the first time:
> >
> >
> > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink,
> > since state lives as Java objects on the heap in HeapKeyedStateBackend
> and
> > the de/serialization only happens during state snapshot and restore, it
> > outperforms RocksDBKeyeStateBackend when all data could reside in
> memory.**However,
> > along with the advantage, HeapKeyedStateBackend also has its
> shortcomings,
> > and the most painful one is the difficulty to estimate the maximum heap
> > size (Xmx) to set, and we will suffer from GC impact once the heap memory
> > is not enough to hold all state data. There’re several (inevitable)
> causes
> > for such scenario, including (but not limited to):*
> >
> >
> >
> > ** Memory overhead of Java object representation (tens of times of the
> > serialized data size).* Data flood caused by burst traffic.* Data
> > accumulation caused by source malfunction.**To resolve this problem, we
> > proposed a solution to support spilling state data to disk before heap
> > memory is exhausted. We will monitor the heap usage and choose the
> coldest
> > data to spill, and reload them when heap memory is regained after data
> > removing or TTL expiration, automatically. Furthermore, *to prevent
> > causing unexpected regression to existing usage of HeapKeyedStateBackend,
> > we plan to introduce a new SpillableHeapKeyedStateBackend and change it
> to
> > default in future if proven to be stable.
> >
> > Please let us know your point of the feature and any comment is
> > welcomed/appreciated. Thanks.
> >
> > [1] https://s.apache.org/pxeif
> > [2]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> >
> > Best Regards,
> > Yu
> >
>

Re: [DISCUSS] FLIP-48: Pluggable Intermediate Result Storage

2019-08-15 Thread Stephan Ewen

Sorry for the late response. So many FLIPs these days.

I am a bit unsure about the motivation here, and that this need to be a
part of Flink. It sounds like this can be perfectly built around Flink as a
minimal library on top of it, without any change in the core APIs or
runtime.

The proposal to handle "caching intermediate results" (to make them
reusable across jobs in a session), and "writing them in different formats
/ indexing them" doesn't sound like it should be the same mechanism.

  - The caching part is a transparent low-level primitive. It avoid
re-executing a part of the job graph, but otherwise is completely
transparent to the consumer job.

  - Writing data out in a sink, compressing/indexing it and then reading it
in another job is also a way of reusing a previous result, but on a
completely different abstraction level. It is not the same intermediate
result any more. When the consumer reads from it and applies predicate
pushdown, etc. then the consumer job looks completely different from a job
that consumed the original result. It hence needs to be solved on the API
level via a sink and a source.

I would suggest to keep these concepts separate: Caching (possibly
automatically) for jobs in a session, and long term writing/sharing of data
sets.

Solving the "long term writing/sharing" in a library rather than in the
runtime also has the advantage of not pushing yet more stuff into Flink's
core, which I believe is also an important criterion.

Best,
Stephan

On Thu, Jul 25, 2019 at 4:53 AM Xuannan Su  wrote:

> Hi folks,
>
> I would like to start the FLIP discussion thread about the pluggable
> intermediate result storage.
>
> This is phase 2 of FLIP-36: Support Interactive Programming in Flink Skip
> to end of metadata. While the FLIP-36 provides a default implementation of
> the intermediate result storage using the shuffle service, we would like to
> make the intermediate result storage pluggable so that the user can easily
> swap the storage.
>
> We are looking forward to your thought!
>
> The FLIP link is the following:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-48%3A+Pluggable+Intermediate+Result+Storage
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-48:+Pluggable+Intermediate+Result+Storage
> >
>
> Best,
> Xuannan
>

[jira] [Created] (FLINK-13739) BinaryRowTest.testWriteString() fails in some environments

2019-08-15 Thread Robert Metzger (JIRA)

Robert Metzger created FLINK-13739:
--

 Summary: BinaryRowTest.testWriteString() fails in some environments
 Key: FLINK-13739
 URL: https://issues.apache.org/jira/browse/FLINK-13739
 Project: Flink
  Issue Type: Task
  Components: Table SQL / Runtime
Affects Versions: 1.9.0
 Environment:  

 
Reporter: Robert Metzger


 

 
{code:java}
Test set: org.apache.flink.table.dataformat.BinaryRowTest
---
Tests run: 26, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.328 s <<< 
FAILURE! - in org.apache.flink.table.dataformat.BinaryRowTest
testWriteString(org.apache.flink.table.dataformat.BinaryRowTest)  Time elapsed: 
0.05 s  <<< FAILURE!
org.junit.ComparisonFailure: 
expected:<[<95><95><95><95><95><88><91><98>
<90><9A><84><89><88><8C>]> but 
was:<[?]>
        at 
org.apache.flink.table.dataformat.BinaryRowTest.testWriteString(BinaryRowTest.java:189)
{code}
 

This error happens on a Google Cloud n2-standard-16 (16 vCPUs, 64 GB memory) 
machine.

{code}$ lsb_release -a
No LSB modules are available.
Distributor ID:Debian
Description:Debian GNU/Linux 9.9 (stretch)
Release:9.9
Codename:stretch{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [DISCUSS] FLIP-50: Spill-able Heap Keyed State Backend

2019-08-15 Thread Stephan Ewen

+1 for this feature. I think this will be appreciated by users, as a way to
use the HeapStateBackend with a safety-net against OOM errors.
And having had major production exposure is great.

>From the implementation plan, it looks like this exists purely in a new
module and does not require any changes in other parts of Flink's code. Can
you confirm that?

Other that that, I have no further questions and we could proceed to vote
on this FLIP, from my side.

Best,
Stephan


On Tue, Aug 13, 2019 at 10:00 PM Yu Li  wrote:

> Sorry for forgetting to give the link of the FLIP, here it is:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-50%3A+Spill-able+Heap+Keyed+State+Backend
>
> Thanks!
>
> Best Regards,
> Yu
>
>
> On Tue, 13 Aug 2019 at 18:06, Yu Li  wrote:
>
> > Hi All,
> >
> > We ever held a discussion about this feature before [1] but now opening
> > another thread because after a second thought introducing a new backend
> > instead of modifying the existing heap backend is a better option to
> > prevent causing any regression or surprise to existing in-production
> usage.
> > And since introducing a new backend is relatively big change, we regard
> it
> > as a FLIP and need another discussion and voting process according to our
> > newly drafted bylaw [2].
> >
> > Please allow me to quote the brief description from the old thread [1]
> for
> > the convenience of those who noticed this feature for the first time:
> >
> >
> > *HeapKeyedStateBackend is one of the two KeyedStateBackends in Flink,
> > since state lives as Java objects on the heap in HeapKeyedStateBackend
> and
> > the de/serialization only happens during state snapshot and restore, it
> > outperforms RocksDBKeyeStateBackend when all data could reside in
> memory.**However,
> > along with the advantage, HeapKeyedStateBackend also has its
> shortcomings,
> > and the most painful one is the difficulty to estimate the maximum heap
> > size (Xmx) to set, and we will suffer from GC impact once the heap memory
> > is not enough to hold all state data. There’re several (inevitable)
> causes
> > for such scenario, including (but not limited to):*
> >
> >
> >
> > ** Memory overhead of Java object representation (tens of times of the
> > serialized data size).* Data flood caused by burst traffic.* Data
> > accumulation caused by source malfunction.**To resolve this problem, we
> > proposed a solution to support spilling state data to disk before heap
> > memory is exhausted. We will monitor the heap usage and choose the
> coldest
> > data to spill, and reload them when heap memory is regained after data
> > removing or TTL expiration, automatically. Furthermore, *to prevent
> > causing unexpected regression to existing usage of HeapKeyedStateBackend,
> > we plan to introduce a new SpillableHeapKeyedStateBackend and change it
> to
> > default in future if proven to be stable.
> >
> > Please let us know your point of the feature and any comment is
> > welcomed/appreciated. Thanks.
> >
> > [1] https://s.apache.org/pxeif
> > [2]
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> >
> > Best Regards,
> > Yu
> >
>

[jira] [Created] (FLINK-13738) NegativeArraySizeException in LongHybridHashTable

2019-08-15 Thread Robert Metzger (JIRA)

Robert Metzger created FLINK-13738:
--

 Summary: NegativeArraySizeException in LongHybridHashTable
 Key: FLINK-13738
 URL: https://issues.apache.org/jira/browse/FLINK-13738
 Project: Flink
  Issue Type: Task
  Components: Table SQL / Runtime
Affects Versions: 1.9.0
Reporter: Robert Metzger


Executing this (meaningless) query:
{code:java}
INSERT INTO sinkTable ( SELECT CONCAT( CAST( id AS VARCHAR), CAST( COUNT(*) AS 
VARCHAR)) as something, 'const' FROM CsvTable, table1  WHERE sometxt LIKE 'a%' 
AND id = key GROUP BY id ) {code}
leads to the following exception:
{code:java}
Caused by: java.lang.NegativeArraySizeException
 at 
org.apache.flink.table.runtime.hashtable.LongHybridHashTable.tryDenseMode(LongHybridHashTable.java:216)
 at 
org.apache.flink.table.runtime.hashtable.LongHybridHashTable.endBuild(LongHybridHashTable.java:105)
 at LongHashJoinOperator$36.endInput1$(Unknown Source)
 at LongHashJoinOperator$36.endInput(Unknown Source)
 at 
org.apache.flink.streaming.runtime.tasks.OperatorChain.endInput(OperatorChain.java:256)
 at 
org.apache.flink.streaming.runtime.io.StreamTwoInputSelectableProcessor.checkFinished(StreamTwoInputSelectableProcessor.java:359)
 at 
org.apache.flink.streaming.runtime.io.StreamTwoInputSelectableProcessor.processInput(StreamTwoInputSelectableProcessor.java:193)
 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performDefaultAction(StreamTask.java:276)
 at org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:298)
 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:403)
 at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:687)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:517)
 at java.lang.Thread.run(Thread.java:748){code}
This is the plan:

 
{code:java}
== Abstract Syntax Tree ==
 LogicalSink(name=[sinkTable], fields=[f0, f1])
 +- LogicalProject(something=[CONCAT(CAST($0):VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE", CAST($1):VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL)], 
EXPR$1=[_UTF-16LE'const'])
 +- LogicalAggregate(group=[
{0}
], agg#0=[COUNT()])
 +- LogicalProject(id=[$1])
 +- LogicalFilter(condition=[AND(LIKE($0, _UTF-16LE'a%'), =($1, 
CAST($2):BIGINT))])
 +- LogicalJoin(condition=[true], joinType=[inner])
 :- LogicalTableScan(table=[[default_catalog, default_database, CsvTable, 
source: [CsvTableSource(read fields: sometxt, id)]]])
 +- LogicalTableScan(table=[[default_catalog, default_database, table1, source: 
[GeneratorTableSource(key, rowtime, payload)]]])
== Optimized Logical Plan ==
 Sink(name=[sinkTable], fields=[f0, f1]): rowcount = 1498810.6659336376, 
cumulative cost =
{4.459964319978008E8 rows, 1.879799762133187E10 cpu, 4.8E9 io, 8.4E8 network, 
1.799524266373455E8 memory}
+- Calc(select=[CONCAT(CAST(id), CAST($f1)) AS something, _UTF-16LE'const' AS 
EXPR$1]): rowcount = 1498810.6659336376, cumulative cost =
{4.444976213318672E8 rows, 1.8796498810665936E10 cpu, 4.8E9 io, 8.4E8 network, 
1.799524266373455E8 memory}
+- HashAggregate(isMerge=[false], groupBy=[id], select=[id, COUNT(*) AS $f1]): 
rowcount = 1498810.6659336376, cumulative cost =
{4.429988106659336E8 rows, 1.8795E10 cpu, 4.8E9 io, 8.4E8 network, 
1.799524266373455E8 memory}
+- Calc(select=[id]): rowcount = 1.575E7, cumulative cost =
{4.415E8 rows, 1.848E10 cpu, 4.8E9 io, 8.4E8 network, 1.2E8 memory}
+- HashJoin(joinType=[InnerJoin], where=[=(id, key0)], select=[id, key0], 
build=[left]): rowcount = 1.575E7, cumulative cost =
{4.2575E8 rows, 1.848E10 cpu, 4.8E9 io, 8.4E8 network, 1.2E8 memory}
:- Exchange(distribution=[hash[id]]): rowcount = 500.0, cumulative cost =
{1.1E8 rows, 8.4E8 cpu, 2.0E9 io, 4.0E7 network, 0.0 memory}
: +- Calc(select=[id], where=[LIKE(sometxt, _UTF-16LE'a%')]): rowcount = 
500.0, cumulative cost =
{1.05E8 rows, 0.0 cpu, 2.0E9 io, 0.0 network, 0.0 memory}
: +- TableSourceScan(table=[[default_catalog, default_database, CsvTable, 
source: [CsvTableSource(read fields: sometxt, id)]]], fields=[sometxt, id]): 
rowcount = 1.0E8, cumulative cost =
{1.0E8 rows, 0.0 cpu, 2.0E9 io, 0.0 network, 0.0 memory}
+- Exchange(distribution=[hash[key0]]): rowcount = 1.0E8, cumulative cost =
{3.0E8 rows, 1.68E10 cpu, 2.8E9 io, 8.0E8 network, 0.0 memory}
+- Calc(select=[CAST(key) AS key0]): rowcount = 1.0E8, cumulative cost =
{2.0E8 rows, 0.0 cpu, 2.8E9 io, 0.0 network, 0.0 memory}
+- TableSourceScan(table=[[default_catalog, default_database, table1, source: 
[GeneratorTableSource(key, rowtime, payload)]]], fields=[key, rowtime, 
payload]): rowcount = 1.0E8, cumulative cost =
{1.0E8 rows, 0.0 cpu, 2.8E9 io, 0.0 network, 0.0 memory}
== Physical Execution Plan ==
 Stage 1 : Data Source
 content : collect elements with CollectionInputFormat
Stage 2 : Operator
 content : CsvTableSource(read fields: sometxt, id)
 ship_strategy : REBALANCE
Stage 3 : Operator
 content : SourceConversi

[DISCUSS] FLIP-53: Fine Grained Resource Management

2019-08-15 Thread Xintong Song

Hi everyone,

We would like to start a discussion thread on "FLIP-53: Fine Grained
Resource Management"[1], where we propose how to improve Flink resource
management and scheduling.

This FLIP mainly discusses the following issues.

   - How to support tasks with fine grained resource requirements.
   - How to unify resource management for jobs with / without fine grained
   resource requirements.
   - How to unify resource management for streaming / batch jobs.

Key changes proposed in the FLIP are as follows.

   - Unify memory management for operators with / without fine grained
   resource requirements by applying a fraction based quota mechanism.
   - Unify resource scheduling for streaming and batch jobs by setting slot
   sharing groups for pipelined regions during compiling stage.
   - Dynamically allocate slots from task executors' available resources.

Please find more details in the FLIP wiki document [1]. Looking forward to
your feedbacks.

Thank you~

Xintong Song


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Resource+Management

Re: [VOTE] FLIP-51: Rework of the Expression Design

2019-08-15 Thread Timo Walther


+1 for this.

Thanks,
Timo

Am 15.08.19 um 15:57 schrieb JingsongLee:

Hi Flink devs,

I would like to start the voting for FLIP-51 Rework of the Expression
  Design.

FLIP wiki:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design

Discussion thread:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html

Google Doc:
https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing

Thanks,

Best,
Jingsong Lee

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Kurt Young

Great, then I have no other comments on legal check.

Best,
Kurt


On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler  wrote:

> The licensing items aren't a problem; we don't care about Flink modules
> in NOTICE files, and we don't have to update the source-release
> licensing since we don't have a pre-built version of the WebUI in the
> source.
>
> On 15/08/2019 15:22, Kurt Young wrote:
> > After going through the licenses, I found 2 suspicions but not sure if
> they
> > are
> > valid or not.
> >
> > 1. flink-state-processing-api is packaged in to flink-dist jar, but not
> > included in
> > NOTICE-binary file (the one under the root directory) like other modules.
> > 2. flink-runtime-web distributed some JavaScript dependencies through
> source
> > codes, the licenses and NOTICE file were only updated inside the module
> of
> > flink-runtime-web, but not the NOTICE file and licenses directory which
> > under
> > the  root directory.
> >
> > Another minor issue I just found is:
> > FLINK-13558 tries to include table examples to flink-dist, but I cannot
> > find it in
> > the binary distribution of RC2.
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:
> >
> >> Hi Gordon & Timo,
> >>
> >> Thanks for the feedback, and I agree with it. I will document this in
> the
> >> release notes.
> >>
> >> Best,
> >> Kurt
> >>
> >>
> >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai <
> tzuli...@apache.org>
> >> wrote:
> >>
> >>> Hi Kurt,
> >>>
> >>> With the same argument as before, given that it is mentioned in the
> >>> release
> >>> announcement that it is a preview feature, I would not block this
> release
> >>> because of it.
> >>> Nevertheless, it would be important to mention this explicitly in the
> >>> release notes [1].
> >>>
> >>> Regards,
> >>> Gordon
> >>>
> >>> [1] https://github.com/apache/flink/pull/9438
> >>>
> >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther 
> wrote:
> >>>
>  Hi Kurt,
> 
>  I agree that this is a serious bug. However, I would not block the
>  release because of this. As you said, there is a workaround and the
>  `execute()` works in the most common case of a single execution. We
> can
>  fix this in a minor release shortly after.
> 
>  What do others think?
> 
>  Regards,
>  Timo
> 
> 
>  Am 15.08.19 um 11:23 schrieb Kurt Young:
> > HI,
> >
> > We just find a serious bug around blink planner:
> > https://issues.apache.org/jira/browse/FLINK-13708
> > When user reused the table environment instance, and call `execute`
>  method
> > multiple times for
> > different sql, the later call will trigger the earlier ones to be
> > re-executed.
> >
> > It's a serious bug but seems we also have a work around, which is
> >>> never
> > reuse the table environment
> > object. I'm not sure if we should treat this one as blocker issue of
>  1.9.0.
> > What's your opinion?
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:
> >
> >> +1 (non-binding)
> >>
> >> Jepsen test suite passed 10 times consecutively
> >>
> >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <
> >>> aljos...@apache.org>
> >> wrote:
> >>
> >>> +1
> >>>
> >>> I did some testing on a Google Cloud Dataproc cluster (it gives you
> >>> a
> >>> managed YARN and Google Cloud Storage (GCS)):
> >>> - tried both YARN session mode and YARN per-job mode, also
> using
> >>> bin/flink list/cancel/etc. against a YARN session cluster
> >>> - ran examples that write to GCS, both with the native Hadoop
> >> FileSystem
> >>> and a custom “plugin” FileSystem
> >>> - ran stateful streaming jobs that use GCS as a checkpoint
> >>> backend
> >>> - tried running SQL programs on YARN using the SQL Cli: this
> >>> worked
>  for
> >>> YARN session mode but not for YARN per-job mode. Looking at the
> >>> code I
> >>> don’t think per-job mode would work from seeing how it is
> >>> implemented.
> >> But
> >>> I think it’s an OK restriction to have for now
> >>> - in all the testing I had fine-grained recovery (region
> >>> failover)
> >>> enabled but I didn’t simulate any failures
> >>>
>  On 14. Aug 2019, at 15:20, Kurt Young  wrote:
> 
>  Hi,
> 
>  Thanks for preparing this release candidate. I have verified the
> >>> following:
>  - verified the checksums and GPG files match the corresponding
> >>> release
> >>> files
>  - verified that the source archives do not contains any binaries
>  - build the source release with Scala 2.11 successfully.
>  - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and
> >>> [FLINK-13688],
>  but
>  both are not release blockers. Other than that, all tests are
> >>> passed.
>  - ra

[VOTE] FLIP-51: Rework of the Expression Design

2019-08-15 Thread JingsongLee


Hi Flink devs, 

I would like to start the voting for FLIP-51 Rework of the Expression
 Design. 

FLIP wiki: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design

Discussion thread: 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html

Google Doc:
https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing

Thanks, 

Best,
Jingsong Lee

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Chesnay Schepler

The licensing items aren't a problem; we don't care about Flink modules 
in NOTICE files, and we don't have to update the source-release 
licensing since we don't have a pre-built version of the WebUI in the 
source.


On 15/08/2019 15:22, Kurt Young wrote:

After going through the licenses, I found 2 suspicions but not sure if they
are
valid or not.

1. flink-state-processing-api is packaged in to flink-dist jar, but not
included in
NOTICE-binary file (the one under the root directory) like other modules.
2. flink-runtime-web distributed some JavaScript dependencies through source
codes, the licenses and NOTICE file were only updated inside the module of
flink-runtime-web, but not the NOTICE file and licenses directory which
under
the  root directory.

Another minor issue I just found is:
FLINK-13558 tries to include table examples to flink-dist, but I cannot
find it in
the binary distribution of RC2.

Best,
Kurt


On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:


Hi Gordon & Timo,

Thanks for the feedback, and I agree with it. I will document this in the
release notes.

Best,
Kurt


On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai 
wrote:


Hi Kurt,

With the same argument as before, given that it is mentioned in the
release
announcement that it is a preview feature, I would not block this release
because of it.
Nevertheless, it would be important to mention this explicitly in the
release notes [1].

Regards,
Gordon

[1] https://github.com/apache/flink/pull/9438

On Thu, Aug 15, 2019 at 11:29 AM Timo Walther  wrote:


Hi Kurt,

I agree that this is a serious bug. However, I would not block the
release because of this. As you said, there is a workaround and the
`execute()` works in the most common case of a single execution. We can
fix this in a minor release shortly after.

What do others think?

Regards,
Timo


Am 15.08.19 um 11:23 schrieb Kurt Young:

HI,

We just find a serious bug around blink planner:
https://issues.apache.org/jira/browse/FLINK-13708
When user reused the table environment instance, and call `execute`

method

multiple times for
different sql, the later call will trigger the earlier ones to be
re-executed.

It's a serious bug but seems we also have a work around, which is

never

reuse the table environment
object. I'm not sure if we should treat this one as blocker issue of

1.9.0.

What's your opinion?

Best,
Kurt


On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:


+1 (non-binding)

Jepsen test suite passed 10 times consecutively

On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <

aljos...@apache.org>

wrote:


+1

I did some testing on a Google Cloud Dataproc cluster (it gives you

a

managed YARN and Google Cloud Storage (GCS)):
- tried both YARN session mode and YARN per-job mode, also using
bin/flink list/cancel/etc. against a YARN session cluster
- ran examples that write to GCS, both with the native Hadoop

FileSystem

and a custom “plugin” FileSystem
- ran stateful streaming jobs that use GCS as a checkpoint

backend

- tried running SQL programs on YARN using the SQL Cli: this

worked

for

YARN session mode but not for YARN per-job mode. Looking at the

code I

don’t think per-job mode would work from seeing how it is

implemented.

But

I think it’s an OK restriction to have for now
- in all the testing I had fine-grained recovery (region

failover)

enabled but I didn’t simulate any failures


On 14. Aug 2019, at 15:20, Kurt Young  wrote:

Hi,

Thanks for preparing this release candidate. I have verified the

following:

- verified the checksums and GPG files match the corresponding

release

files

- verified that the source archives do not contains any binaries
- build the source release with Scala 2.11 successfully.
- ran `mvn verify` locally, met 2 issuses [FLINK-13687] and

[FLINK-13688],

but
both are not release blockers. Other than that, all tests are

passed.

- ran all e2e tests which don't need download external packages

(it's

very

unstable
in China and almost impossible to download them), all passed.
- started local cluster, ran some examples. Met a small website

display

issue
[FLINK-13591], which is also not a release blocker.

Although we have pushed some fixes around blink planner and hive
integration
after RC2, but consider these are both preview features, I'm lean

to

be

ok

to release
without these fixes.

+1 from my side. (binding)

Best,
Kurt


On Wed, Aug 14, 2019 at 5:13 PM Jark Wu  wrote:


Hi Gordon,

I have verified the following things:

- build the source release with Scala 2.12 and Scala 2.11

successfully

- checked/verified signatures and hashes
- checked that all POM files point to the same version
- ran some flink table related end-to-end tests locally and

succeeded

(except TPC-H e2e failed which is reported in FLINK-13704)
- started cluster for both Scala 2.11 and 2.12, ran examples,

verified

web

ui and log output, nothing unexpected
- started cluster, ran a SQL query to temporal join with kafka

so

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Aleksey Pak

Hi all!

Thanks for starting this discussion.

I'd like to also add my 2 cents:

+1 for #2, differential build scripts.
I've worked on the approach. And with it, I think it's possible to reduce
total build time with relatively low effort, without enforcing any new
build tool and low maintenance cost.

You can check a proposed change (for the old CI setup, when Flink PRs were
running in Apache common CI pool) here:
https://github.com/apache/flink/pull/9065
In the proposed change, the dependency check is not heavily hardcoded and
just uses maven's results for dependency graph analysis.

> This approach is conceptually quite straight-forward, but has limits
since it has to be pessimistic; > i.e. a change in flink-core _must_ result
in testing all modules.

Agree, in Flink case, there are some core modules that would trigger whole
tests run with such approach. For developers who modify such components,
the build time would be the longest. But this approach should really help
for developers who touch more-or-less independent modules.

Even for core modules, it's possible to create "abstraction" barriers by
changing dependency graph. For example, it can look like: flink-core-api
<-- flink-core, flink-core-api <-- flink-connectors.
In that case, only change in flink-core-api would trigger whole tests run.

+1 for #3, separating PR CI runs to different stages.
Imo, it may require more change to current CI setup, compared to #2 and
better it should not be silly. Best, if it integrates with the Flink bot
and triggers some follow up build steps only when some prerequisites are
done.

+1 for #4, to move some tests into cron runs.
But imo, this does not scale well, it applies only to a small subset of
tests.

+1 for #6, to use other CI service(s).
More specifically, GitHub gives build actions for free that can be used to
offload some build steps/PR checks. It can help to move out some PR checks
from the main CI build (for example: documentation builds, license checks,
code formatting checks).

Regards,
Aleksey

On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann  wrote:

> Thanks for starting this discussion Chesnay. I think it has become obvious
> to the Flink community that with the existing build setup we cannot really
> deliver fast build times which are essential for fast iteration cycles and
> high developer productivity. The reasons for this situation are manifold
> but it is definitely affected by Flink's project growth, not always optimal
> tests and the inflexibility that everything needs to be built. Hence, I
> consider the reduction of build times crucial for the project's health and
> future growth.
>
> Without necessarily voicing a strong preference for any of the presented
> suggestions, I wanted to comment on each of them:
>
> 1. This sounds promising. Could the reason why we don't reuse JVMs date
> back to the time when we still had a lot of static fields in Flink which
> made it hard to reuse JVMs and the potentially mutated global state?
>
> 2. Building hand-crafted solutions around a build system in order to
> compensate for its limitations which other build systems support out of the
> box sounds like the not invented here syndrome to me. Reinventing the wheel
> has historically proven to be usually not the best solution and it often
> comes with a high maintenance price tag. Moreover, it would add just
> another layer of complexity around our existing build system. I think the
> current state where we have the maven setup in pom files and for Travis
> multiple bash scripts specializing the builds to make it fit the time limit
> is already not very transparent/easy to understand.
>
> 3. I could see this work but it also requires a very good understanding of
> Flink of every committer because the committer needs to know which tests
> would be good to run additionally.
>
> 4. I would be against this option solely to decrease our build time. My
> observation is that the community does not monitor the health of the cron
> jobs well enough. In the past the cron jobs have been unstable for as long
> as a complete release cycle. Moreover, I've seen that PRs were merged which
> passed Travis but broke the cron jobs. Consequently, I fear that this
> option would deteriorate Flink's stability.
>
> 5. I would rephrase this point into changing the build system. Gradle could
> be one candidate but there are also other build systems out there like
> Bazel. Changing the build system would indeed be a major endeavour but I
> could see the long term benefits of such a change (similar to having a
> consistent and enforced code style) in particular if the build system
> supports the functionality which we would otherwise build & maintain on our
> own. I think there would be ways to make the transition not as disruptive
> as described. For example, one could keep the Maven build and the new build
> side by side until one is confident enough that the new build produces the
> same output as the Maven build. Maybe it would al

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Dawid Wysakowicz

Thanks Kurt for checking that.

The mentioned problem with table-examples is that, when working on
FLINK-13558, I forgot to add dependency on flink-examples-table to
flink-dist. So this module is not built if only the flink-dist with its
dependencies is built (this happens in the release scripts: -pl
flink-dist -am) I created FLINK-13737 to fix that.

As those are only examples I wouldn't block the release on them. We
might need to change the fixVersion of the mentioned FLINK-13558 not to
confuse users. The proper fix we could include in 1.9.1. WDYT?

Best,

Dawid


[1] https://issues.apache.org/jira/browse/FLINK-13737

On 15/08/2019 15:22, Kurt Young wrote:
> After going through the licenses, I found 2 suspicions but not sure if they
> are
> valid or not.
>
> 1. flink-state-processing-api is packaged in to flink-dist jar, but not
> included in
> NOTICE-binary file (the one under the root directory) like other modules.
> 2. flink-runtime-web distributed some JavaScript dependencies through source
> codes, the licenses and NOTICE file were only updated inside the module of
> flink-runtime-web, but not the NOTICE file and licenses directory which
> under
> the  root directory.
>
> Another minor issue I just found is:
> FLINK-13558 tries to include table examples to flink-dist, but I cannot
> find it in
> the binary distribution of RC2.
>
> Best,
> Kurt
>
>
> On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:
>
>> Hi Gordon & Timo,
>>
>> Thanks for the feedback, and I agree with it. I will document this in the
>> release notes.
>>
>> Best,
>> Kurt
>>
>>
>> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai 
>> wrote:
>>
>>> Hi Kurt,
>>>
>>> With the same argument as before, given that it is mentioned in the
>>> release
>>> announcement that it is a preview feature, I would not block this release
>>> because of it.
>>> Nevertheless, it would be important to mention this explicitly in the
>>> release notes [1].
>>>
>>> Regards,
>>> Gordon
>>>
>>> [1] https://github.com/apache/flink/pull/9438
>>>
>>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther  wrote:
>>>
 Hi Kurt,

 I agree that this is a serious bug. However, I would not block the
 release because of this. As you said, there is a workaround and the
 `execute()` works in the most common case of a single execution. We can
 fix this in a minor release shortly after.

 What do others think?

 Regards,
 Timo


 Am 15.08.19 um 11:23 schrieb Kurt Young:
> HI,
>
> We just find a serious bug around blink planner:
> https://issues.apache.org/jira/browse/FLINK-13708
> When user reused the table environment instance, and call `execute`
 method
> multiple times for
> different sql, the later call will trigger the earlier ones to be
> re-executed.
>
> It's a serious bug but seems we also have a work around, which is
>>> never
> reuse the table environment
> object. I'm not sure if we should treat this one as blocker issue of
 1.9.0.
> What's your opinion?
>
> Best,
> Kurt
>
>
> On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:
>
>> +1 (non-binding)
>>
>> Jepsen test suite passed 10 times consecutively
>>
>> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <
>>> aljos...@apache.org>
>> wrote:
>>
>>> +1
>>>
>>> I did some testing on a Google Cloud Dataproc cluster (it gives you
>>> a
>>> managed YARN and Google Cloud Storage (GCS)):
>>>- tried both YARN session mode and YARN per-job mode, also using
>>> bin/flink list/cancel/etc. against a YARN session cluster
>>>- ran examples that write to GCS, both with the native Hadoop
>> FileSystem
>>> and a custom “plugin” FileSystem
>>>- ran stateful streaming jobs that use GCS as a checkpoint
>>> backend
>>>- tried running SQL programs on YARN using the SQL Cli: this
>>> worked
 for
>>> YARN session mode but not for YARN per-job mode. Looking at the
>>> code I
>>> don’t think per-job mode would work from seeing how it is
>>> implemented.
>> But
>>> I think it’s an OK restriction to have for now
>>>- in all the testing I had fine-grained recovery (region
>>> failover)
>>> enabled but I didn’t simulate any failures
>>>
 On 14. Aug 2019, at 15:20, Kurt Young  wrote:

 Hi,

 Thanks for preparing this release candidate. I have verified the
>>> following:
 - verified the checksums and GPG files match the corresponding
>>> release
>>> files
 - verified that the source archives do not contains any binaries
 - build the source release with Scala 2.11 successfully.
 - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and
>>> [FLINK-13688],
 but
 both are not release blockers. Other than that, all tests are
>>> passed.
 - ran all e2e tests which don't need download exte

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-15 Thread Hequn Cheng

Congratulations Andrey!

On Thu, Aug 15, 2019 at 3:30 PM Fabian Hueske  wrote:

> Congrats Andrey!
>
> Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao :
>
> > Congratulations Andrey, well deserved!
> >
> > Best,
> > Gary
> >
> > On Thu, Aug 15, 2019 at 7:50 AM Bowen Li  wrote:
> >
> > > Congratulations Andrey!
> > >
> > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong 
> wrote:
> > >
> > >> Congratulations Andrey!
> > >>
> > >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok 
> wrote:
> > >>
> > >> > Congratulations Andrey!
> > >> > At 2019-08-14 21:26:37, "Till Rohrmann" 
> wrote:
> > >> > >Hi everyone,
> > >> > >
> > >> > >I'm very happy to announce that Andrey Zagrebin accepted the offer
> of
> > >> the
> > >> > >Flink PMC to become a committer of the Flink project.
> > >> > >
> > >> > >Andrey has been an active community member for more than 15 months.
> > He
> > >> has
> > >> > >helped shaping numerous features such as State TTL, FRocksDB
> release,
> > >> > >Shuffle service abstraction, FLIP-1, result partition management
> and
> > >> > >various fixes/improvements. He's also frequently helping out on the
> > >> > >user@f.a.o mailing lists.
> > >> > >
> > >> > >Congratulations Andrey!
> > >> > >
> > >> > >Best, Till
> > >> > >(on behalf of the Flink PMC)
> > >> >
> > >>
> > >
> >
>

[jira] [Created] (FLINK-13737) flink-dist should add provided dependency on flink-examples-table

2019-08-15 Thread Dawid Wysakowicz (JIRA)

Dawid Wysakowicz created FLINK-13737:


 Summary: flink-dist should add provided dependency on 
flink-examples-table
 Key: FLINK-13737
 URL: https://issues.apache.org/jira/browse/FLINK-13737
 Project: Flink
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.9.0
Reporter: Dawid Wysakowicz
Assignee: Dawid Wysakowicz
 Fix For: 1.10.0, 1.9.1


In FLINK-13558 we changed the `flink-dist/bin.xml` to also include 
flink-examples-table in the binary distribution. The flink-dist module though 
does not depend on the flink-examples-table.

If only the flink-dist module is built with its dependencies (this happens in 
the release scripts). The table examples are not built and thus not included in 
the distribution



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Kurt Young

After going through the licenses, I found 2 suspicions but not sure if they
are
valid or not.

1. flink-state-processing-api is packaged in to flink-dist jar, but not
included in
NOTICE-binary file (the one under the root directory) like other modules.
2. flink-runtime-web distributed some JavaScript dependencies through source
codes, the licenses and NOTICE file were only updated inside the module of
flink-runtime-web, but not the NOTICE file and licenses directory which
under
the  root directory.

Another minor issue I just found is:
FLINK-13558 tries to include table examples to flink-dist, but I cannot
find it in
the binary distribution of RC2.

Best,
Kurt


On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:

> Hi Gordon & Timo,
>
> Thanks for the feedback, and I agree with it. I will document this in the
> release notes.
>
> Best,
> Kurt
>
>
> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai 
> wrote:
>
>> Hi Kurt,
>>
>> With the same argument as before, given that it is mentioned in the
>> release
>> announcement that it is a preview feature, I would not block this release
>> because of it.
>> Nevertheless, it would be important to mention this explicitly in the
>> release notes [1].
>>
>> Regards,
>> Gordon
>>
>> [1] https://github.com/apache/flink/pull/9438
>>
>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther  wrote:
>>
>> > Hi Kurt,
>> >
>> > I agree that this is a serious bug. However, I would not block the
>> > release because of this. As you said, there is a workaround and the
>> > `execute()` works in the most common case of a single execution. We can
>> > fix this in a minor release shortly after.
>> >
>> > What do others think?
>> >
>> > Regards,
>> > Timo
>> >
>> >
>> > Am 15.08.19 um 11:23 schrieb Kurt Young:
>> > > HI,
>> > >
>> > > We just find a serious bug around blink planner:
>> > > https://issues.apache.org/jira/browse/FLINK-13708
>> > > When user reused the table environment instance, and call `execute`
>> > method
>> > > multiple times for
>> > > different sql, the later call will trigger the earlier ones to be
>> > > re-executed.
>> > >
>> > > It's a serious bug but seems we also have a work around, which is
>> never
>> > > reuse the table environment
>> > > object. I'm not sure if we should treat this one as blocker issue of
>> > 1.9.0.
>> > >
>> > > What's your opinion?
>> > >
>> > > Best,
>> > > Kurt
>> > >
>> > >
>> > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:
>> > >
>> > >> +1 (non-binding)
>> > >>
>> > >> Jepsen test suite passed 10 times consecutively
>> > >>
>> > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <
>> aljos...@apache.org>
>> > >> wrote:
>> > >>
>> > >>> +1
>> > >>>
>> > >>> I did some testing on a Google Cloud Dataproc cluster (it gives you
>> a
>> > >>> managed YARN and Google Cloud Storage (GCS)):
>> > >>>- tried both YARN session mode and YARN per-job mode, also using
>> > >>> bin/flink list/cancel/etc. against a YARN session cluster
>> > >>>- ran examples that write to GCS, both with the native Hadoop
>> > >> FileSystem
>> > >>> and a custom “plugin” FileSystem
>> > >>>- ran stateful streaming jobs that use GCS as a checkpoint
>> backend
>> > >>>- tried running SQL programs on YARN using the SQL Cli: this
>> worked
>> > for
>> > >>> YARN session mode but not for YARN per-job mode. Looking at the
>> code I
>> > >>> don’t think per-job mode would work from seeing how it is
>> implemented.
>> > >> But
>> > >>> I think it’s an OK restriction to have for now
>> > >>>- in all the testing I had fine-grained recovery (region
>> failover)
>> > >>> enabled but I didn’t simulate any failures
>> > >>>
>> >  On 14. Aug 2019, at 15:20, Kurt Young  wrote:
>> > 
>> >  Hi,
>> > 
>> >  Thanks for preparing this release candidate. I have verified the
>> > >>> following:
>> >  - verified the checksums and GPG files match the corresponding
>> release
>> > >>> files
>> >  - verified that the source archives do not contains any binaries
>> >  - build the source release with Scala 2.11 successfully.
>> >  - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and
>> > >>> [FLINK-13688],
>> >  but
>> >  both are not release blockers. Other than that, all tests are
>> passed.
>> >  - ran all e2e tests which don't need download external packages
>> (it's
>> > >>> very
>> >  unstable
>> >  in China and almost impossible to download them), all passed.
>> >  - started local cluster, ran some examples. Met a small website
>> > display
>> >  issue
>> >  [FLINK-13591], which is also not a release blocker.
>> > 
>> >  Although we have pushed some fixes around blink planner and hive
>> >  integration
>> >  after RC2, but consider these are both preview features, I'm lean
>> to
>> > be
>> > >>> ok
>> >  to release
>> >  without these fixes.
>> > 
>> >  +1 from my side. (binding)
>> > 
>> >  Best,
>> >  Kurt
>> > 
>> > 
>> >  On Wed,

Re: [DISCUSS] FLIP-51: Rework of the Expression Design

2019-08-15 Thread Timo Walther

Hi,

regarding the LegacyTypeInformation esp. for decimals. I don't have a
clear answer yet, but I think it should not limit us. If possible it
should travel through the type inference and we only need some special
cases at some locations e.g. when computing the leastRestrictive. E.g.
the logical type root is set correctly which is required for family
checking.

I'm wondering when a decimal type with precision can occur. Usually, it
should come from literals or defined column. But I think it might also
be ok that the flink-planner just receives a decimal with precision and
treats it with Java semantics.

Doing it on the planner side is the easiest option but also the one that
could cause side-effects if the back-and-forth conversion of
LegacyTypeConverter is not a 1:1 mapping anymore. I guess we will see
the implications during implementation. All old Flink tests should pass
still.

Regards,
Timo

Am 15.08.19 um 10:43 schrieb JingsongLee:

Hi @Timo Walther @Dawid Wysakowicz:

Now, flink-planner have some legacy DataTypes:
like: legacy decimal, legacy basic array type info...
And If the new type inference infer a Decimal/VarChar with precision, there
should will fail in TypeConversions.

The better we do on DataType, the more problems we will have with
TypeInformation conversion, and the new TypeInference is a lot of precision
related.
What do you think?
1. Should TypeConversions support all data types and flink-planner support new
types?2. Or do a special conversion between flink-planner and type inference.

(There are many problems with the conversion between TypeInformation and
DataType, and I think we should solve them completely in 1.10.)

Best,
Jingsong Lee

--
From:JingsongLee
Send Time:2019年8月15日(星期四) 10:31
To:dev
Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design

Hi jark:

I'll add a chapter to list blink planner extended functions.

Best,
Jingsong Lee

--
From:Jark Wu
Send Time:2019年8月15日(星期四) 05:12
To:dev
Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design

Thanks Jingsong for starting the discussion.

The general design of the FLIP looks good to me. +1 for the FLIP. It's time
to get rid of the old Expression!

Regarding to the function behavior, shall we also include new functions
from blink planner (e.g. LISTAGG, REGEXP, TO_DATE, etc..) ?

Best,
Jark

On Wed, 14 Aug 2019 at 23:34, Timo Walther wrote:

Hi Jingsong,

thanks for writing down this FLIP. Big +1 from my side to finally get
rid of PlannerExpressions and have consistent and well-defined behavior
for Table API and SQL updated to FLIP-37.

We might need to discuss some of the behavior of particular functions
but this should not affect the actual FLIP-51.

Regards,
Timo

Am 13.08.19 um 12:55 schrieb JingsongLee:

Hi everyone,

We would like to start a discussion thread on "FLIP-51: Rework of the
Expression Design"(Design doc: [1], FLIP: [2]), where we describe how
to improve the new java Expressions to work with type inference and
convert expression to the calcite RexNode. This is a follow-up plan
for FLIP-32[3] and FLIP-37[4]. This FLIP is mostly based on FLIP-37.

This FLIP addresses several shortcomings of current:
- New Expressions still use PlannerExpressions to type inference and
to RexNode. Flnk-planner and blink-planner have a lot of repetitive

code

and logic.
- Let TableApi and Cacite definitions consistent.
- Reduce the complexity of Function development.
- Powerful Function for user.

Key changes can be summarized as follows:
- Improve the interface of FunctionDefinition.
- Introduce type inference for built-in functions.
- Introduce ExpressionConverter to convert Expression to calcite
RexNode.
- Remove repetitive code and logic in planners.

I also listed type inference and behavior of all built-in functions [5],

verify that the interface is satisfied. After introduce type inference to
table-common module, planners should have a unified function behavior.
And this gives the community also the chance to quickly discuss types
and behavior of functions a last time before they are declared stable.

Looking forward to your feedbacks. Thank you.

[1]

https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing

[2]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design

[3]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-32%3A+Restructure+flink-table+for+future+contributions

[4]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System

[5]

https://docs.google.com/document/d/1fyVmdGgbO1XmIyQ1BaoG_h5BcNcF3q9UJ1Bj1euO240/edit?usp=sharing

Best,
Jingsong Lee

Re: Watermarks not propagated to WebUI?

2019-08-15 Thread Chesnay Schepler

I remember an issue regarding the watermark fetch request from the WebUI 
exceeding some HTTP size limit, since it tries to fetch all watermarks 
at once, and the format of this request isn't exactly efficient.


Querying metrics for individual operators still works since the request 
is small enough.


Not sure whether we ever fixed that.

On 15/08/2019 12:01, Jan Lukavský wrote:

Hi,

Thomas, thanks for confirming this. I have noticed, that in 1.9 the 
WebUI has been reworked a lot, does anyone know if this is still an 
issue? I currently cannot easily try 1.9, so I cannot confirm or 
disprove that.


Jan

On 8/14/19 6:25 PM, Thomas Weise wrote:
I have also noticed this issue (Flink 1.5, Flink 1.8), and it appears 
with

higher parallelism.

This can be confusing to the user when watermarks actually work and 
can be

observed using the metrics.

On Wed, Aug 14, 2019 at 7:36 AM Jan Lukavský  wrote:


Hi,

is it possible, that watermarks are sometimes not propagated to WebUI,
although they are internally moving as normal? I see in WebUI every
operator showing "No Watermark", but outputs seem to be propagated to
sink (and there are watermark sensitive operations involved - e.g.
reductions on fixed windows without early emitting). More strangely,
this happens when I increase parallelism above some threshold. If I use
parallelism of N, watermarks are shown, when I increase it above some
number (seems not to be exactly deterministic), watermarks seems to
disappear.

I'm using Flink 1.8.1.

Did anyone experience something like this before?

Jan

Re: flink 1.9 DDL nested json derived

2019-08-15 Thread Jark Wu

Hi Shengnan,

Yes. Flink 1.9 supports nested json derived. You should declare the ROW
type with nested schema explicitly. I tested a similar DDL against 1.9.0
RC2 and worked well.

CREATE TABLE kafka_json_source (
rowtime VARCHAR,
user_name VARCHAR,
event ROW
) WITH (
'connector.type' = 'kafka',
'connector.version' = 'universal',
'connector.topic' = 'test-json',
'connector.startup-mode' = 'earliest-offset',
'connector.properties.0.key' = 'zookeeper.connect',
'connector.properties.0.value' = 'localhost:2181',
'connector.properties.1.key' = 'bootstrap.servers',
'connector.properties.1.value' = 'localhost:9092',
'update-mode' = 'append',
'format.type' = 'json',
'format.derive-schema' = 'true'
);

The kafka message is

{"rowtime": "2018-03-12T08:00:00Z", "user_name": "Alice", "event": {
"message_type": "WARNING", "message": "This is a warning."}}


Thanks,
Jark


On Thu, 15 Aug 2019 at 14:12, Shengnan YU  wrote:

>
> Hi guys
> I am trying the DDL feature in branch 1.9-releasae.  I am stucked in
> creating a table from kafka with nested json format. Is it possibe to
> specify a "Row" type of columns to derive the nested json schema?
>
> String sql = "create table kafka_stream(\n" +
> "  a varchar, \n" +
> "  b varchar,\n" +
> "  c int,\n" +
> "  inner_json row\n" +
> ") with (\n" +
> "  'connector.type' ='kafka',\n" +
> "  'connector.version' = '0.11',\n" +
> "  'update-mode' = 'append', \n" +
> "  'connector.topic' = 'test',\n" +
> "  'connector.properties.0.key' = 'bootstrap.servers',\n" +
> "  'connector.properties.0.value' = 'localhost:9092',\n" +
> "  'format.type' = 'json', \n" +
> "  'format.derive-schema' = 'true'\n" +
> ")\n";
>
>  Thank you very much!
>

[jira] [Created] (FLINK-13736) Support count window with blink planner in batch mode

2019-08-15 Thread Kurt Young (JIRA)

Kurt Young created FLINK-13736:
--

 Summary: Support count window with blink planner in batch mode
 Key: FLINK-13736
 URL: https://issues.apache.org/jira/browse/FLINK-13736
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner, Table SQL / Runtime
Reporter: Kurt Young






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (FLINK-13735) Support session window with blink planner in batch mode

2019-08-15 Thread Kurt Young (JIRA)

Kurt Young created FLINK-13735:
--

 Summary: Support session window with blink planner in batch mode
 Key: FLINK-13735
 URL: https://issues.apache.org/jira/browse/FLINK-13735
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Planner
Reporter: Kurt Young






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Kurt Young

Hi Gordon & Timo,

Thanks for the feedback, and I agree with it. I will document this in the
release notes.

Best,
Kurt


On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai 
wrote:

> Hi Kurt,
>
> With the same argument as before, given that it is mentioned in the release
> announcement that it is a preview feature, I would not block this release
> because of it.
> Nevertheless, it would be important to mention this explicitly in the
> release notes [1].
>
> Regards,
> Gordon
>
> [1] https://github.com/apache/flink/pull/9438
>
> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther  wrote:
>
> > Hi Kurt,
> >
> > I agree that this is a serious bug. However, I would not block the
> > release because of this. As you said, there is a workaround and the
> > `execute()` works in the most common case of a single execution. We can
> > fix this in a minor release shortly after.
> >
> > What do others think?
> >
> > Regards,
> > Timo
> >
> >
> > Am 15.08.19 um 11:23 schrieb Kurt Young:
> > > HI,
> > >
> > > We just find a serious bug around blink planner:
> > > https://issues.apache.org/jira/browse/FLINK-13708
> > > When user reused the table environment instance, and call `execute`
> > method
> > > multiple times for
> > > different sql, the later call will trigger the earlier ones to be
> > > re-executed.
> > >
> > > It's a serious bug but seems we also have a work around, which is never
> > > reuse the table environment
> > > object. I'm not sure if we should treat this one as blocker issue of
> > 1.9.0.
> > >
> > > What's your opinion?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> Jepsen test suite passed 10 times consecutively
> > >>
> > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek  >
> > >> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> I did some testing on a Google Cloud Dataproc cluster (it gives you a
> > >>> managed YARN and Google Cloud Storage (GCS)):
> > >>>- tried both YARN session mode and YARN per-job mode, also using
> > >>> bin/flink list/cancel/etc. against a YARN session cluster
> > >>>- ran examples that write to GCS, both with the native Hadoop
> > >> FileSystem
> > >>> and a custom “plugin” FileSystem
> > >>>- ran stateful streaming jobs that use GCS as a checkpoint backend
> > >>>- tried running SQL programs on YARN using the SQL Cli: this
> worked
> > for
> > >>> YARN session mode but not for YARN per-job mode. Looking at the code
> I
> > >>> don’t think per-job mode would work from seeing how it is
> implemented.
> > >> But
> > >>> I think it’s an OK restriction to have for now
> > >>>- in all the testing I had fine-grained recovery (region failover)
> > >>> enabled but I didn’t simulate any failures
> > >>>
> >  On 14. Aug 2019, at 15:20, Kurt Young  wrote:
> > 
> >  Hi,
> > 
> >  Thanks for preparing this release candidate. I have verified the
> > >>> following:
> >  - verified the checksums and GPG files match the corresponding
> release
> > >>> files
> >  - verified that the source archives do not contains any binaries
> >  - build the source release with Scala 2.11 successfully.
> >  - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and
> > >>> [FLINK-13688],
> >  but
> >  both are not release blockers. Other than that, all tests are
> passed.
> >  - ran all e2e tests which don't need download external packages
> (it's
> > >>> very
> >  unstable
> >  in China and almost impossible to download them), all passed.
> >  - started local cluster, ran some examples. Met a small website
> > display
> >  issue
> >  [FLINK-13591], which is also not a release blocker.
> > 
> >  Although we have pushed some fixes around blink planner and hive
> >  integration
> >  after RC2, but consider these are both preview features, I'm lean to
> > be
> > >>> ok
> >  to release
> >  without these fixes.
> > 
> >  +1 from my side. (binding)
> > 
> >  Best,
> >  Kurt
> > 
> > 
> >  On Wed, Aug 14, 2019 at 5:13 PM Jark Wu  wrote:
> > 
> > > Hi Gordon,
> > >
> > > I have verified the following things:
> > >
> > > - build the source release with Scala 2.12 and Scala 2.11
> > successfully
> > > - checked/verified signatures and hashes
> > > - checked that all POM files point to the same version
> > > - ran some flink table related end-to-end tests locally and
> succeeded
> > > (except TPC-H e2e failed which is reported in FLINK-13704)
> > > - started cluster for both Scala 2.11 and 2.12, ran examples,
> > verified
> > >>> web
> > > ui and log output, nothing unexpected
> > > - started cluster, ran a SQL query to temporal join with kafka
> source
> > >>> and
> > > mysql jdbc table, and write results to kafka again. Using DDL to
> > >> create
> > >>> the
> > > source and sinks. looks good.
> > > - reviewed the release PR

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Tzu-Li (Gordon) Tai

Hi Kurt,

With the same argument as before, given that it is mentioned in the release
announcement that it is a preview feature, I would not block this release
because of it.
Nevertheless, it would be important to mention this explicitly in the
release notes [1].

Regards,
Gordon

[1] https://github.com/apache/flink/pull/9438

On Thu, Aug 15, 2019 at 11:29 AM Timo Walther  wrote:

> Hi Kurt,
>
> I agree that this is a serious bug. However, I would not block the
> release because of this. As you said, there is a workaround and the
> `execute()` works in the most common case of a single execution. We can
> fix this in a minor release shortly after.
>
> What do others think?
>
> Regards,
> Timo
>
>
> Am 15.08.19 um 11:23 schrieb Kurt Young:
> > HI,
> >
> > We just find a serious bug around blink planner:
> > https://issues.apache.org/jira/browse/FLINK-13708
> > When user reused the table environment instance, and call `execute`
> method
> > multiple times for
> > different sql, the later call will trigger the earlier ones to be
> > re-executed.
> >
> > It's a serious bug but seems we also have a work around, which is never
> > reuse the table environment
> > object. I'm not sure if we should treat this one as blocker issue of
> 1.9.0.
> >
> > What's your opinion?
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:
> >
> >> +1 (non-binding)
> >>
> >> Jepsen test suite passed 10 times consecutively
> >>
> >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek 
> >> wrote:
> >>
> >>> +1
> >>>
> >>> I did some testing on a Google Cloud Dataproc cluster (it gives you a
> >>> managed YARN and Google Cloud Storage (GCS)):
> >>>- tried both YARN session mode and YARN per-job mode, also using
> >>> bin/flink list/cancel/etc. against a YARN session cluster
> >>>- ran examples that write to GCS, both with the native Hadoop
> >> FileSystem
> >>> and a custom “plugin” FileSystem
> >>>- ran stateful streaming jobs that use GCS as a checkpoint backend
> >>>- tried running SQL programs on YARN using the SQL Cli: this worked
> for
> >>> YARN session mode but not for YARN per-job mode. Looking at the code I
> >>> don’t think per-job mode would work from seeing how it is implemented.
> >> But
> >>> I think it’s an OK restriction to have for now
> >>>- in all the testing I had fine-grained recovery (region failover)
> >>> enabled but I didn’t simulate any failures
> >>>
>  On 14. Aug 2019, at 15:20, Kurt Young  wrote:
> 
>  Hi,
> 
>  Thanks for preparing this release candidate. I have verified the
> >>> following:
>  - verified the checksums and GPG files match the corresponding release
> >>> files
>  - verified that the source archives do not contains any binaries
>  - build the source release with Scala 2.11 successfully.
>  - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and
> >>> [FLINK-13688],
>  but
>  both are not release blockers. Other than that, all tests are passed.
>  - ran all e2e tests which don't need download external packages (it's
> >>> very
>  unstable
>  in China and almost impossible to download them), all passed.
>  - started local cluster, ran some examples. Met a small website
> display
>  issue
>  [FLINK-13591], which is also not a release blocker.
> 
>  Although we have pushed some fixes around blink planner and hive
>  integration
>  after RC2, but consider these are both preview features, I'm lean to
> be
> >>> ok
>  to release
>  without these fixes.
> 
>  +1 from my side. (binding)
> 
>  Best,
>  Kurt
> 
> 
>  On Wed, Aug 14, 2019 at 5:13 PM Jark Wu  wrote:
> 
> > Hi Gordon,
> >
> > I have verified the following things:
> >
> > - build the source release with Scala 2.12 and Scala 2.11
> successfully
> > - checked/verified signatures and hashes
> > - checked that all POM files point to the same version
> > - ran some flink table related end-to-end tests locally and succeeded
> > (except TPC-H e2e failed which is reported in FLINK-13704)
> > - started cluster for both Scala 2.11 and 2.12, ran examples,
> verified
> >>> web
> > ui and log output, nothing unexpected
> > - started cluster, ran a SQL query to temporal join with kafka source
> >>> and
> > mysql jdbc table, and write results to kafka again. Using DDL to
> >> create
> >>> the
> > source and sinks. looks good.
> > - reviewed the release PR
> >
> > As FLINK-13704 is not recognized as blocker issue, so +1 from my side
> > (non-binding).
> >
> > On Tue, 13 Aug 2019 at 17:07, Till Rohrmann 
> >>> wrote:
> >> Hi Richard,
> >>
> >> although I can see that it would be handy for users who have PubSub
> >> set
> > up,
> >> I would rather not include examples which require an external
> >>> dependency
> >> into the Flink distribution. I think examples should be
> >> se

Re: Watermarks not propagated to WebUI?

2019-08-15 Thread Jan Lukavský


Hi,

Thomas, thanks for confirming this. I have noticed, that in 1.9 the 
WebUI has been reworked a lot, does anyone know if this is still an 
issue? I currently cannot easily try 1.9, so I cannot confirm or 
disprove that.


Jan

On 8/14/19 6:25 PM, Thomas Weise wrote:

I have also noticed this issue (Flink 1.5, Flink 1.8), and it appears with
higher parallelism.

This can be confusing to the user when watermarks actually work and can be
observed using the metrics.

On Wed, Aug 14, 2019 at 7:36 AM Jan Lukavský  wrote:


Hi,

is it possible, that watermarks are sometimes not propagated to WebUI,
although they are internally moving as normal? I see in WebUI every
operator showing "No Watermark", but outputs seem to be propagated to
sink (and there are watermark sensitive operations involved - e.g.
reductions on fixed windows without early emitting). More strangely,
this happens when I increase parallelism above some threshold. If I use
parallelism of N, watermarks are shown, when I increase it above some
number (seems not to be exactly deterministic), watermarks seems to
disappear.

I'm using Flink 1.8.1.

Did anyone experience something like this before?

Jan

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Andrey Zagrebin

+1 (non-binding)

Tested in AWS EMR Yarn: 1 master and 4 worker nodes (m5.xlarge: 4 vCore, 16
GiB).

EMR runs only on Java 8. Fine-grained recovery is enabled by default.

Modified E2E test scripts can be found here (asserting output):
https://github.com/azagrebin/flink/commits/FLINK-13597

Batch SQL:

   - S3(a) filesystem over HADOOP works out-of-the-box (already on AWS
   class path) and also if put in plugins

Streaming SQL:

   - Hadoop output (s3 does not support recoverable writers)


On Thu, Aug 15, 2019 at 11:24 AM Kurt Young  wrote:

> HI,
>
> We just find a serious bug around blink planner:
> https://issues.apache.org/jira/browse/FLINK-13708
> When user reused the table environment instance, and call `execute` method
> multiple times for
> different sql, the later call will trigger the earlier ones to be
> re-executed.
>
> It's a serious bug but seems we also have a work around, which is never
> reuse the table environment
> object. I'm not sure if we should treat this one as blocker issue of 1.9.0.
>
> What's your opinion?
>
> Best,
> Kurt
>
>
> On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:
>
> > +1 (non-binding)
> >
> > Jepsen test suite passed 10 times consecutively
> >
> > On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek 
> > wrote:
> >
> > > +1
> > >
> > > I did some testing on a Google Cloud Dataproc cluster (it gives you a
> > > managed YARN and Google Cloud Storage (GCS)):
> > >   - tried both YARN session mode and YARN per-job mode, also using
> > > bin/flink list/cancel/etc. against a YARN session cluster
> > >   - ran examples that write to GCS, both with the native Hadoop
> > FileSystem
> > > and a custom “plugin” FileSystem
> > >   - ran stateful streaming jobs that use GCS as a checkpoint backend
> > >   - tried running SQL programs on YARN using the SQL Cli: this worked
> for
> > > YARN session mode but not for YARN per-job mode. Looking at the code I
> > > don’t think per-job mode would work from seeing how it is implemented.
> > But
> > > I think it’s an OK restriction to have for now
> > >   - in all the testing I had fine-grained recovery (region failover)
> > > enabled but I didn’t simulate any failures
> > >
> > > > On 14. Aug 2019, at 15:20, Kurt Young  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thanks for preparing this release candidate. I have verified the
> > > following:
> > > >
> > > > - verified the checksums and GPG files match the corresponding
> release
> > > files
> > > > - verified that the source archives do not contains any binaries
> > > > - build the source release with Scala 2.11 successfully.
> > > > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and
> > > [FLINK-13688],
> > > > but
> > > > both are not release blockers. Other than that, all tests are passed.
> > > > - ran all e2e tests which don't need download external packages (it's
> > > very
> > > > unstable
> > > > in China and almost impossible to download them), all passed.
> > > > - started local cluster, ran some examples. Met a small website
> display
> > > > issue
> > > > [FLINK-13591], which is also not a release blocker.
> > > >
> > > > Although we have pushed some fixes around blink planner and hive
> > > > integration
> > > > after RC2, but consider these are both preview features, I'm lean to
> be
> > > ok
> > > > to release
> > > > without these fixes.
> > > >
> > > > +1 from my side. (binding)
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 5:13 PM Jark Wu  wrote:
> > > >
> > > >> Hi Gordon,
> > > >>
> > > >> I have verified the following things:
> > > >>
> > > >> - build the source release with Scala 2.12 and Scala 2.11
> successfully
> > > >> - checked/verified signatures and hashes
> > > >> - checked that all POM files point to the same version
> > > >> - ran some flink table related end-to-end tests locally and
> succeeded
> > > >> (except TPC-H e2e failed which is reported in FLINK-13704)
> > > >> - started cluster for both Scala 2.11 and 2.12, ran examples,
> verified
> > > web
> > > >> ui and log output, nothing unexpected
> > > >> - started cluster, ran a SQL query to temporal join with kafka
> source
> > > and
> > > >> mysql jdbc table, and write results to kafka again. Using DDL to
> > create
> > > the
> > > >> source and sinks. looks good.
> > > >> - reviewed the release PR
> > > >>
> > > >> As FLINK-13704 is not recognized as blocker issue, so +1 from my
> side
> > > >> (non-binding).
> > > >>
> > > >> On Tue, 13 Aug 2019 at 17:07, Till Rohrmann 
> > > wrote:
> > > >>
> > > >>> Hi Richard,
> > > >>>
> > > >>> although I can see that it would be handy for users who have PubSub
> > set
> > > >> up,
> > > >>> I would rather not include examples which require an external
> > > dependency
> > > >>> into the Flink distribution. I think examples should be
> > self-contained.
> > > >> My
> > > >>> concern is that we would bloat the distribution for many users at
> the
> > > >>> benefit of a few. Instead, I think it would be better to mak

Re: How to load udf jars in flink program

2019-08-15 Thread Zhu Zhu

Hi Jiangang,

Does "flink run -j jarpath ..." work for you?
If that jar id deployed to the same path on each worker machine, you can
try "flink run -C classpath ..." as well.

Thanks,
Zhu Zhu

刘建刚  于2019年8月15日周四 下午5:31写道：

>   We are using per-job to load udf jar when start job. Our jar file is
> in another path but not flink's lib path. In the main function, we use
> classLoader to load the jar file by the jar path. But it reports the
> following error when job starts running.
>   If the jar file is in lib, everything is ok. But our udf jar file is
> managed in a special path. How can I load udf jars in flink program with
> only giving the jar path?
>
> org.apache.flink.api.common.InvalidProgramException: Table program cannot be 
> compiled. This is a bug. Please file an issue.
>   at 
> org.apache.flink.table.codegen.Compiler$class.compile(Compiler.scala:36)
>   at 
> org.apache.flink.table.runtime.CRowProcessRunner.compile(CRowProcessRunner.scala:35)
>   at 
> org.apache.flink.table.runtime.CRowProcessRunner.open(CRowProcessRunner.scala:49)
>   at 
> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
>   at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
>   at 
> org.apache.flink.streaming.api.operators.ProcessOperator.open(ProcessOperator.java:56)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:723)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.codehaus.commons.compiler.CompileException: Line 5, Column 1: 
> Cannot determine simple type name "com"
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11877)
>   at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6758)
>   at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6519)
>   at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
>   at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
>   at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
>   at 
> org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
>   at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6498)
>   at org.codehaus.janino.UnitCompiler.access$14000(UnitCompiler.java:218)
>   at 
> org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6405)
>   at 
> org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6400)
>   at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3983)
>   at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6400)
>   at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6393)
>   at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3982)
>   at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393)
>   at org.codehaus.janino.UnitCompiler.access$1300(UnitCompiler.java:218)
>   at org.codehaus.janino.UnitCompiler$25.getType(UnitCompiler.java:8206)
>   at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6798)
>   at org.codehaus.janino.UnitCompiler.access$14500(UnitCompiler.java:218)
>   at 
> org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6423)
>   at 
> org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6418)
>   at org.codehaus.janino.Java$FieldAccess.accept(Java.java:4365)
>   at 
> org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418)
>   at 
> org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414)
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203)
>   at 
> org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414)
>   at 
> org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393)
>   at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171)
>   at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393)
>   at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6780)
>   at org.codehaus.janino.UnitCompiler.access$14300(UnitCompiler.java:218)
>   at 
> org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6421)
>   at 
> org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6418)
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4279)
>   at 
> org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418)
>   at 
> org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414)
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203)
>

How to load udf jars in flink program

2019-08-15 Thread 刘建刚

  We are using per-job to load udf jar when start job. Our jar file is
in another path but not flink's lib path. In the main function, we use
classLoader to load the jar file by the jar path. But it reports the
following error when job starts running.
  If the jar file is in lib, everything is ok. But our udf jar file is
managed in a special path. How can I load udf jars in flink program with
only giving the jar path?

org.apache.flink.api.common.InvalidProgramException: Table program
cannot be compiled. This is a bug. Please file an issue.
at 
org.apache.flink.table.codegen.Compiler$class.compile(Compiler.scala:36)
at 
org.apache.flink.table.runtime.CRowProcessRunner.compile(CRowProcessRunner.scala:35)
at 
org.apache.flink.table.runtime.CRowProcessRunner.open(CRowProcessRunner.scala:49)
at 
org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at 
org.apache.flink.streaming.api.operators.ProcessOperator.open(ProcessOperator.java:56)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:723)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.codehaus.commons.compiler.CompileException: Line 5,
Column 1: Cannot determine simple type name "com"
at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11877)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6758)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6519)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at 
org.codehaus.janino.UnitCompiler.getReferenceType(UnitCompiler.java:6532)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6498)
at org.codehaus.janino.UnitCompiler.access$14000(UnitCompiler.java:218)
at 
org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6405)
at 
org.codehaus.janino.UnitCompiler$22$1.visitReferenceType(UnitCompiler.java:6400)
at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3983)
at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6400)
at org.codehaus.janino.UnitCompiler$22.visitType(UnitCompiler.java:6393)
at org.codehaus.janino.Java$ReferenceType.accept(Java.java:3982)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393)
at org.codehaus.janino.UnitCompiler.access$1300(UnitCompiler.java:218)
at org.codehaus.janino.UnitCompiler$25.getType(UnitCompiler.java:8206)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6798)
at org.codehaus.janino.UnitCompiler.access$14500(UnitCompiler.java:218)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6423)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitFieldAccess(UnitCompiler.java:6418)
at org.codehaus.janino.Java$FieldAccess.accept(Java.java:4365)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414)
at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393)
at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393)
at org.codehaus.janino.UnitCompiler.getType2(UnitCompiler.java:6780)
at org.codehaus.janino.UnitCompiler.access$14300(UnitCompiler.java:218)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6421)
at 
org.codehaus.janino.UnitCompiler$22$2$1.visitAmbiguousName(UnitCompiler.java:6418)
at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4279)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6418)
at 
org.codehaus.janino.UnitCompiler$22$2.visitLvalue(UnitCompiler.java:6414)
at org.codehaus.janino.Java$Lvalue.accept(Java.java:4203)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6414)
at 
org.codehaus.janino.UnitCompiler$22.visitRvalue(UnitCompiler.java:6393)
at org.codehaus.janino.Java$Rvalue.accept(Java.java:4171)
at org.codehaus.janino.UnitCompiler.getType(UnitCompiler.java:6393

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Timo Walther


Hi Kurt,

I agree that this is a serious bug. However, I would not block the 
release because of this. As you said, there is a workaround and the 
`execute()` works in the most common case of a single execution. We can 
fix this in a minor release shortly after.


What do others think?

Regards,
Timo


Am 15.08.19 um 11:23 schrieb Kurt Young:

HI,

We just find a serious bug around blink planner:
https://issues.apache.org/jira/browse/FLINK-13708
When user reused the table environment instance, and call `execute` method
multiple times for
different sql, the later call will trigger the earlier ones to be
re-executed.

It's a serious bug but seems we also have a work around, which is never
reuse the table environment
object. I'm not sure if we should treat this one as blocker issue of 1.9.0.

What's your opinion?

Best,
Kurt


On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:


+1 (non-binding)

Jepsen test suite passed 10 times consecutively

On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek 
wrote:


+1

I did some testing on a Google Cloud Dataproc cluster (it gives you a
managed YARN and Google Cloud Storage (GCS)):
   - tried both YARN session mode and YARN per-job mode, also using
bin/flink list/cancel/etc. against a YARN session cluster
   - ran examples that write to GCS, both with the native Hadoop

FileSystem

and a custom “plugin” FileSystem
   - ran stateful streaming jobs that use GCS as a checkpoint backend
   - tried running SQL programs on YARN using the SQL Cli: this worked for
YARN session mode but not for YARN per-job mode. Looking at the code I
don’t think per-job mode would work from seeing how it is implemented.

But

I think it’s an OK restriction to have for now
   - in all the testing I had fine-grained recovery (region failover)
enabled but I didn’t simulate any failures


On 14. Aug 2019, at 15:20, Kurt Young  wrote:

Hi,

Thanks for preparing this release candidate. I have verified the

following:

- verified the checksums and GPG files match the corresponding release

files

- verified that the source archives do not contains any binaries
- build the source release with Scala 2.11 successfully.
- ran `mvn verify` locally, met 2 issuses [FLINK-13687] and

[FLINK-13688],

but
both are not release blockers. Other than that, all tests are passed.
- ran all e2e tests which don't need download external packages (it's

very

unstable
in China and almost impossible to download them), all passed.
- started local cluster, ran some examples. Met a small website display
issue
[FLINK-13591], which is also not a release blocker.

Although we have pushed some fixes around blink planner and hive
integration
after RC2, but consider these are both preview features, I'm lean to be

ok

to release
without these fixes.

+1 from my side. (binding)

Best,
Kurt


On Wed, Aug 14, 2019 at 5:13 PM Jark Wu  wrote:


Hi Gordon,

I have verified the following things:

- build the source release with Scala 2.12 and Scala 2.11 successfully
- checked/verified signatures and hashes
- checked that all POM files point to the same version
- ran some flink table related end-to-end tests locally and succeeded
(except TPC-H e2e failed which is reported in FLINK-13704)
- started cluster for both Scala 2.11 and 2.12, ran examples, verified

web

ui and log output, nothing unexpected
- started cluster, ran a SQL query to temporal join with kafka source

and

mysql jdbc table, and write results to kafka again. Using DDL to

create

the

source and sinks. looks good.
- reviewed the release PR

As FLINK-13704 is not recognized as blocker issue, so +1 from my side
(non-binding).

On Tue, 13 Aug 2019 at 17:07, Till Rohrmann 

wrote:

Hi Richard,

although I can see that it would be handy for users who have PubSub

set

up,

I would rather not include examples which require an external

dependency

into the Flink distribution. I think examples should be

self-contained.

My

concern is that we would bloat the distribution for many users at the
benefit of a few. Instead, I think it would be better to make these
examples available differently, maybe through Flink's ecosystem

website

or

maybe a new examples section in Flink's documentation.

Cheers,
Till

On Tue, Aug 13, 2019 at 9:43 AM Jark Wu  wrote:


Hi Till,

After thinking about we can use VARCHAR as an alternative of
timestamp/time/date.
I'm fine with not recognize it as a blocker issue.
We can fix it into 1.9.1.


Thanks,
Jark


On Tue, 13 Aug 2019 at 15:10, Richard Deurwaarder 

wrote:

Hello all,

I noticed the PubSub example jar is not included in the examples/

dir

of

flink-dist. I've created

https://issues.apache.org/jira/browse/FLINK-13700

+ https://github.com/apache/flink/pull/9424/files to fix this.

I will leave it up to you to decide if we want to add this to

1.9.0.

Regards,

Richard

On Tue, Aug 13, 2019 at 9:04 AM Till Rohrmann <

trohrm...@apache.org>

wrote:


Hi Jark,

thanks for reporting this issue. Could this be a documented

limitation

of

Blink'

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Kurt Young

HI,

We just find a serious bug around blink planner:
https://issues.apache.org/jira/browse/FLINK-13708
When user reused the table environment instance, and call `execute` method
multiple times for
different sql, the later call will trigger the earlier ones to be
re-executed.

It's a serious bug but seems we also have a work around, which is never
reuse the table environment
object. I'm not sure if we should treat this one as blocker issue of 1.9.0.

What's your opinion?

Best,
Kurt


On Thu, Aug 15, 2019 at 2:01 PM Gary Yao  wrote:

> +1 (non-binding)
>
> Jepsen test suite passed 10 times consecutively
>
> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek 
> wrote:
>
> > +1
> >
> > I did some testing on a Google Cloud Dataproc cluster (it gives you a
> > managed YARN and Google Cloud Storage (GCS)):
> >   - tried both YARN session mode and YARN per-job mode, also using
> > bin/flink list/cancel/etc. against a YARN session cluster
> >   - ran examples that write to GCS, both with the native Hadoop
> FileSystem
> > and a custom “plugin” FileSystem
> >   - ran stateful streaming jobs that use GCS as a checkpoint backend
> >   - tried running SQL programs on YARN using the SQL Cli: this worked for
> > YARN session mode but not for YARN per-job mode. Looking at the code I
> > don’t think per-job mode would work from seeing how it is implemented.
> But
> > I think it’s an OK restriction to have for now
> >   - in all the testing I had fine-grained recovery (region failover)
> > enabled but I didn’t simulate any failures
> >
> > > On 14. Aug 2019, at 15:20, Kurt Young  wrote:
> > >
> > > Hi,
> > >
> > > Thanks for preparing this release candidate. I have verified the
> > following:
> > >
> > > - verified the checksums and GPG files match the corresponding release
> > files
> > > - verified that the source archives do not contains any binaries
> > > - build the source release with Scala 2.11 successfully.
> > > - ran `mvn verify` locally, met 2 issuses [FLINK-13687] and
> > [FLINK-13688],
> > > but
> > > both are not release blockers. Other than that, all tests are passed.
> > > - ran all e2e tests which don't need download external packages (it's
> > very
> > > unstable
> > > in China and almost impossible to download them), all passed.
> > > - started local cluster, ran some examples. Met a small website display
> > > issue
> > > [FLINK-13591], which is also not a release blocker.
> > >
> > > Although we have pushed some fixes around blink planner and hive
> > > integration
> > > after RC2, but consider these are both preview features, I'm lean to be
> > ok
> > > to release
> > > without these fixes.
> > >
> > > +1 from my side. (binding)
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Aug 14, 2019 at 5:13 PM Jark Wu  wrote:
> > >
> > >> Hi Gordon,
> > >>
> > >> I have verified the following things:
> > >>
> > >> - build the source release with Scala 2.12 and Scala 2.11 successfully
> > >> - checked/verified signatures and hashes
> > >> - checked that all POM files point to the same version
> > >> - ran some flink table related end-to-end tests locally and succeeded
> > >> (except TPC-H e2e failed which is reported in FLINK-13704)
> > >> - started cluster for both Scala 2.11 and 2.12, ran examples, verified
> > web
> > >> ui and log output, nothing unexpected
> > >> - started cluster, ran a SQL query to temporal join with kafka source
> > and
> > >> mysql jdbc table, and write results to kafka again. Using DDL to
> create
> > the
> > >> source and sinks. looks good.
> > >> - reviewed the release PR
> > >>
> > >> As FLINK-13704 is not recognized as blocker issue, so +1 from my side
> > >> (non-binding).
> > >>
> > >> On Tue, 13 Aug 2019 at 17:07, Till Rohrmann 
> > wrote:
> > >>
> > >>> Hi Richard,
> > >>>
> > >>> although I can see that it would be handy for users who have PubSub
> set
> > >> up,
> > >>> I would rather not include examples which require an external
> > dependency
> > >>> into the Flink distribution. I think examples should be
> self-contained.
> > >> My
> > >>> concern is that we would bloat the distribution for many users at the
> > >>> benefit of a few. Instead, I think it would be better to make these
> > >>> examples available differently, maybe through Flink's ecosystem
> website
> > >> or
> > >>> maybe a new examples section in Flink's documentation.
> > >>>
> > >>> Cheers,
> > >>> Till
> > >>>
> > >>> On Tue, Aug 13, 2019 at 9:43 AM Jark Wu  wrote:
> > >>>
> >  Hi Till,
> > 
> >  After thinking about we can use VARCHAR as an alternative of
> >  timestamp/time/date.
> >  I'm fine with not recognize it as a blocker issue.
> >  We can fix it into 1.9.1.
> > 
> > 
> >  Thanks,
> >  Jark
> > 
> > 
> >  On Tue, 13 Aug 2019 at 15:10, Richard Deurwaarder 
> > >>> wrote:
> > 
> > > Hello all,
> > >
> > > I noticed the PubSub example jar is not included in the examples/
> dir
> > >>> of
> > > flink-dist. I've created
> >

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Till Rohrmann

Thanks for starting this discussion Chesnay. I think it has become obvious
to the Flink community that with the existing build setup we cannot really
deliver fast build times which are essential for fast iteration cycles and
high developer productivity. The reasons for this situation are manifold
but it is definitely affected by Flink's project growth, not always optimal
tests and the inflexibility that everything needs to be built. Hence, I
consider the reduction of build times crucial for the project's health and
future growth.

Without necessarily voicing a strong preference for any of the presented
suggestions, I wanted to comment on each of them:

1. This sounds promising. Could the reason why we don't reuse JVMs date
back to the time when we still had a lot of static fields in Flink which
made it hard to reuse JVMs and the potentially mutated global state?

2. Building hand-crafted solutions around a build system in order to
compensate for its limitations which other build systems support out of the
box sounds like the not invented here syndrome to me. Reinventing the wheel
has historically proven to be usually not the best solution and it often
comes with a high maintenance price tag. Moreover, it would add just
another layer of complexity around our existing build system. I think the
current state where we have the maven setup in pom files and for Travis
multiple bash scripts specializing the builds to make it fit the time limit
is already not very transparent/easy to understand.

3. I could see this work but it also requires a very good understanding of
Flink of every committer because the committer needs to know which tests
would be good to run additionally.

4. I would be against this option solely to decrease our build time. My
observation is that the community does not monitor the health of the cron
jobs well enough. In the past the cron jobs have been unstable for as long
as a complete release cycle. Moreover, I've seen that PRs were merged which
passed Travis but broke the cron jobs. Consequently, I fear that this
option would deteriorate Flink's stability.

5. I would rephrase this point into changing the build system. Gradle could
be one candidate but there are also other build systems out there like
Bazel. Changing the build system would indeed be a major endeavour but I
could see the long term benefits of such a change (similar to having a
consistent and enforced code style) in particular if the build system
supports the functionality which we would otherwise build & maintain on our
own. I think there would be ways to make the transition not as disruptive
as described. For example, one could keep the Maven build and the new build
side by side until one is confident enough that the new build produces the
same output as the Maven build. Maybe it would also be possible to migrate
individual modules starting from the leaves. However, I admit that changing
the build system will affect every Flink developer because she needs to
learn & understand it.

6. I would like to learn about other people's experience with different CI
systems. Travis worked okish for Flink so far but we see sometimes problems
with its caching mechanism as Chesnay stated. I think that this topic is
actually orthogonal to the other suggestions.

My gut feeling is that not a single suggestion will be our solution but a
combination of them.

Cheers,
Till

On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu  wrote:

> Thanks Chesnay for bringing up this discussion and sharing those thoughts
> to speed up the building process.
>
> I'd +1 for option 2 and 3.
>
> We can benefits a lot from Option 2. Developing table, connectors,
> libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to
> run.
> PRs for those modules take up more than half of all the PRs in my
> observation.
>
> Option 3 can be a supplementary to option 2 that if the PR is modifying
> fundamental modules like flink-core or flink-runtime.
> It can even be a switch of the tests scope(basic/full) of a PR, so that
> committers do not need to trigger it multiple times.
> With it we can postpone the testing of IT cases or connectors before the PR
> reaches a stable state.
>
> Thanks,
> Zhu Zhu
>
> Chesnay Schepler  于2019年8月15日周四 下午3:38写道：
>
> > Hello everyone,
> >
> > improving our build times is a hot topic at the moment so let's discuss
> > the different ways how they could be reduced.
> >
> >
> > Current state:
> >
> > First up, let's look at some numbers:
> >
> > 1 full build currently consumes 5h of build time total ("total time"),
> > and in the ideal case takes about 1h20m ("run time") to complete from
> > start to finish. The run time may fluctuate of course depending on the
> > current Travis load. This applies both to builds on the Apache and
> > flink-ci Travis.
> >
> > At the time of writing, the current queue time for PR jobs (reminder:
> > running on flink-ci) is about 30 minutes (which basically means that we
> > are processing builds at

Re: [DISCUSS] Reducing build times

2019-08-15 Thread Zhu Zhu

Thanks Chesnay for bringing up this discussion and sharing those thoughts
to speed up the building process.

I'd +1 for option 2 and 3.

We can benefits a lot from Option 2. Developing table, connectors,
libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to
run.
PRs for those modules take up more than half of all the PRs in my
observation.

Option 3 can be a supplementary to option 2 that if the PR is modifying
fundamental modules like flink-core or flink-runtime.
It can even be a switch of the tests scope(basic/full) of a PR, so that
committers do not need to trigger it multiple times.
With it we can postpone the testing of IT cases or connectors before the PR
reaches a stable state.

Thanks,
Zhu Zhu

Chesnay Schepler  于2019年8月15日周四 下午3:38写道：

> Hello everyone,
>
> improving our build times is a hot topic at the moment so let's discuss
> the different ways how they could be reduced.
>
>
> Current state:
>
> First up, let's look at some numbers:
>
> 1 full build currently consumes 5h of build time total ("total time"),
> and in the ideal case takes about 1h20m ("run time") to complete from
> start to finish. The run time may fluctuate of course depending on the
> current Travis load. This applies both to builds on the Apache and
> flink-ci Travis.
>
> At the time of writing, the current queue time for PR jobs (reminder:
> running on flink-ci) is about 30 minutes (which basically means that we
> are processing builds at the rate that they come in), however we are in
> an admittedly quiet period right now.
> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> everyone was scrambling to get their changes merged in time for the
> feature freeze.
>
> (Note: Recently optimizations where added to ci-bot where pending builds
> are canceled if a new commit was pushed to the PR or the PR was closed,
> which should prove especially useful during the rush hours we see before
> feature-freezes.)
>
>
> Past approaches
>
> Over the years we have done rather few things to improve this situation
> (hence our current predicament).
>
> Beyond the sporadic speedup of some tests, the only notable reduction in
> total build times was the introduction of cron jobs, which consolidated
> the per-commit matrix from 4 configurations (different scala/hadoop
> versions) to 1.
>
> The separation into multiple build profiles was only a work-around for
> the 50m limit on Travis. Running tests in parallel has the obvious
> potential of reducing run time, but we're currently hitting a hard limit
> since a few modules (flink-tests, flink-runtime,
> flink-table-planner-blink) are so loaded with tests that they nearly
> consume an entire profile by themselves (and thus no further splitting
> is possible).
>
> The rework that introduced stages, at the time of introduction, did also
> not provide a speed up, although this changed slightly once more
> profiles were added and some optimizations to the caching have been made.
>
> Very recently we modified the surefire-plugin configuration for
> flink-table-planner-blink to reuse JVM forks for IT cases, providing a
> significant speedup (18 minutes!). So far we have not seen any negative
> consequences.
>
>
> Suggestions
>
> This is a list of /all /suggestions for reducing run/total times that I
> have seen recently (in other words, they aren't necessarily mine nor may
> I agree with all of them).
>
>  1. Enable JVM reuse for IT cases in more modules.
>   * We've seen significant speedups in the blink planner, and this
> should be applicable for all modules. However, I presume there's
> a reason why we disabled JVM reuse (information on this would be
> appreciated)
>  2. Custom differential build scripts
>   * Setup custom scripts for determining which modules might be
> affected by change, and manipulate the splits accordingly. This
> approach is conceptually quite straight-forward, but has limits
> since it has to be pessimistic; i.e. a change in flink-core
> _must_ result in testing all modules.
>  3. Only run smoke tests when PR is opened, run heavy tests on demand.
>   * With the introduction of the ci-bot we now have significantly
> more options on how to handle PR builds. One option could be to
> only run basic tests when the PR is created (which may be only
> modified modules, or all unit tests, or another low-cost
> scheme), and then have a committer trigger other builds (full
> test run, e2e tests, etc...) on demand.
>  4. Move more tests into cron builds
>   * The budget version of 3); move certain tests that are either
> expensive (like some runtime tests that take minutes) or in
> rarely modified modules (like gelly) into cron jobs.
>  5. Gradle
>   * Gradle was brought up a few times for it's built-in support for
> differential builds; basically providing 2) without the overhead
>

Re: [DISCUSS] FLIP-51: Rework of the Expression Design

2019-08-15 Thread JingsongLee

Hi @Timo Walther @Dawid Wysakowicz:

Now, flink-planner have some legacy DataTypes:
like: legacy decimal, legacy basic array type info...
And If the new type inference infer a Decimal/VarChar with precision, there 
should will fail in TypeConversions.

The better we do on DataType, the more problems we will have with 
TypeInformation conversion, and the new TypeInference is a lot of precision 
related.
What do you think?
1. Should TypeConversions support all data types and flink-planner support new 
types?2. Or do a special conversion between flink-planner and type inference.

(There are many problems with the conversion between TypeInformation and 
DataType, and I think we should solve them completely in 1.10.)

Best,
Jingsong Lee


--
From:JingsongLee 
Send Time:2019年8月15日(星期四) 10:31
To:dev 
Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design

Hi jark:

I'll add a chapter to list blink planner extended functions.

Best,
 Jingsong Lee


--
From:Jark Wu 
Send Time:2019年8月15日(星期四) 05:12
To:dev 
Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design

Thanks Jingsong for starting the discussion.

The general design of the FLIP looks good to me. +1 for the FLIP. It's time
to get rid of the old Expression!

Regarding to the function behavior, shall we also include new functions
from blink planner (e.g. LISTAGG, REGEXP, TO_DATE, etc..) ?


Best,
Jark





On Wed, 14 Aug 2019 at 23:34, Timo Walther  wrote:

> Hi Jingsong,
>
> thanks for writing down this FLIP. Big +1 from my side to finally get
> rid of PlannerExpressions and have consistent and well-defined behavior
> for Table API and SQL updated to FLIP-37.
>
> We might need to discuss some of the behavior of particular functions
> but this should not affect the actual FLIP-51.
>
> Regards,
> Timo
>
>
> Am 13.08.19 um 12:55 schrieb JingsongLee:
> > Hi everyone,
> >
> > We would like to start a discussion thread on "FLIP-51: Rework of the
> > Expression Design"(Design doc: [1], FLIP: [2]), where we describe how
> >   to improve the new java Expressions to work with type inference and
> >   convert expression to the calcite RexNode. This is a follow-up plan
> > for FLIP-32[3] and FLIP-37[4]. This FLIP is mostly based on FLIP-37.
> >
> > This FLIP addresses several shortcomings of current:
> > - New Expressions still use PlannerExpressions to type inference and
> >   to RexNode. Flnk-planner and blink-planner have a lot of repetitive
> code
> >   and logic.
> > - Let TableApi and Cacite definitions consistent.
> > - Reduce the complexity of Function development.
> > - Powerful Function for user.
> >
> > Key changes can be summarized as follows:
> > - Improve the interface of FunctionDefinition.
> > - Introduce type inference for built-in functions.
> > - Introduce ExpressionConverter to convert Expression to calcite
> >   RexNode.
> > - Remove repetitive code and logic in planners.
> >
> > I also listed type inference and behavior of all built-in functions [5],
> to
> > verify that the interface is satisfied. After introduce type inference to
> > table-common module, planners should have a unified function behavior.
> > And this gives the community also the chance to quickly discuss types
> >   and behavior of functions a last time before they are declared stable.
> >
> > Looking forward to your feedbacks. Thank you.
> >
> > [1]
> https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing
> > [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
> > [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-32%3A+Restructure+flink-table+for+future+contributions
> > [4]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System
> > [5]
> https://docs.google.com/document/d/1fyVmdGgbO1XmIyQ1BaoG_h5BcNcF3q9UJ1Bj1euO240/edit?usp=sharing
> >
> > Best,
> > Jingsong Lee
>
>
>

Re: [DISCUSS] FLIP-51: Rework of the Expression Design

2019-08-15 Thread JingsongLee

Hi jark:

I'll add a chapter to list blink planner extended functions.

Best,
 Jingsong Lee


--
From:Jark Wu 
Send Time:2019年8月15日(星期四) 05:12
To:dev 
Subject:Re: [DISCUSS] FLIP-51: Rework of the Expression Design

Thanks Jingsong for starting the discussion.

The general design of the FLIP looks good to me. +1 for the FLIP. It's time
to get rid of the old Expression!

Regarding to the function behavior, shall we also include new functions
from blink planner (e.g. LISTAGG, REGEXP, TO_DATE, etc..) ?


Best,
Jark





On Wed, 14 Aug 2019 at 23:34, Timo Walther  wrote:

> Hi Jingsong,
>
> thanks for writing down this FLIP. Big +1 from my side to finally get
> rid of PlannerExpressions and have consistent and well-defined behavior
> for Table API and SQL updated to FLIP-37.
>
> We might need to discuss some of the behavior of particular functions
> but this should not affect the actual FLIP-51.
>
> Regards,
> Timo
>
>
> Am 13.08.19 um 12:55 schrieb JingsongLee:
> > Hi everyone,
> >
> > We would like to start a discussion thread on "FLIP-51: Rework of the
> > Expression Design"(Design doc: [1], FLIP: [2]), where we describe how
> >   to improve the new java Expressions to work with type inference and
> >   convert expression to the calcite RexNode. This is a follow-up plan
> > for FLIP-32[3] and FLIP-37[4]. This FLIP is mostly based on FLIP-37.
> >
> > This FLIP addresses several shortcomings of current:
> > - New Expressions still use PlannerExpressions to type inference and
> >   to RexNode. Flnk-planner and blink-planner have a lot of repetitive
> code
> >   and logic.
> > - Let TableApi and Cacite definitions consistent.
> > - Reduce the complexity of Function development.
> > - Powerful Function for user.
> >
> > Key changes can be summarized as follows:
> > - Improve the interface of FunctionDefinition.
> > - Introduce type inference for built-in functions.
> > - Introduce ExpressionConverter to convert Expression to calcite
> >   RexNode.
> > - Remove repetitive code and logic in planners.
> >
> > I also listed type inference and behavior of all built-in functions [5],
> to
> > verify that the interface is satisfied. After introduce type inference to
> > table-common module, planners should have a unified function behavior.
> > And this gives the community also the chance to quickly discuss types
> >   and behavior of functions a last time before they are declared stable.
> >
> > Looking forward to your feedbacks. Thank you.
> >
> > [1]
> https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing
> > [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
> > [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-32%3A+Restructure+flink-table+for+future+contributions
> > [4]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System
> > [5]
> https://docs.google.com/document/d/1fyVmdGgbO1XmIyQ1BaoG_h5BcNcF3q9UJ1Bj1euO240/edit?usp=sharing
> >
> > Best,
> > Jingsong Lee
>
>
>

[jira] [Created] (FLINK-13734) Support DDL in SQL CLI

2019-08-15 Thread Jark Wu (JIRA)

Jark Wu created FLINK-13734:
---

 Summary: Support DDL in SQL CLI
 Key: FLINK-13734
 URL: https://issues.apache.org/jira/browse/FLINK-13734
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / Client
Reporter: Jark Wu


We have supported DDL in TableEnvironment. We should also support to execute 
DDL on SQL client to make the feature to be used more easily. However, this 
might need to modify the current architecture of SQL Client. More detailed 
design should be attached and discussed. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Re: [DISCUSS] FLIP-52: Remove legacy Program interface.

2019-08-15 Thread Biao Liu

+1

Thanks Kostas for pushing this.

Thanks,
Biao /'bɪ.aʊ/



On Thu, 15 Aug 2019 at 16:03, Kostas Kloudas  wrote:

> Thanks a lot for the quick response!
> I will consider the Flink Accepted and will start working on it.
>
> Cheers,
> Kostas
>
> On Thu, Aug 15, 2019 at 5:29 AM SHI Xiaogang 
> wrote:
> >
> > +1
> >
> > Glad that programming with flink becomes simpler and easier.
> >
> > Regards,
> > Xiaogang
> >
> > Aljoscha Krettek  于2019年8月14日周三 下午11:31写道：
> >
> > > +1 (for the same reasons I posted on the other thread)
> > >
> > > > On 14. Aug 2019, at 15:03, Zili Chen  wrote:
> > > >
> > > > +1
> > > >
> > > > It could be regarded as part of Flink client api refactor.
> > > > Removal of stale code paths helps reason refactor.
> > > >
> > > > There is one thing worth attention that in this thread[1] Thomas
> > > > suggests an interface with a method return JobGraph based on the
> > > > fact that REST API and in per job mode actually extracts the JobGraph
> > > > from user program and submit it instead of running user program and
> > > > submission happens inside the program in session scenario.
> > > >
> > > > Such an interface would be like
> > > >
> > > > interface Program {
> > > >  JobGraph getJobGraph(args);
> > > > }
> > > >
> > > > Anyway, the discussion above could be continued in that thread.
> > > > Current Program is a legacy class that quite less useful than it
> should
> > > be.
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > > [1]
> > > >
> > >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/REST-API-JarRunHandler-More-flexibility-for-launching-jobs-td31026.html#a31168
> > > >
> > > >
> > > > Stephan Ewen  于2019年8月14日周三 下午7:50写道：
> > > >
> > > >> +1
> > > >>
> > > >> the "main" method is the overwhelming default. getting rid of "two
> ways
> > > to
> > > >> do things" is a good idea.
> > > >>
> > > >> On Wed, Aug 14, 2019 at 1:42 PM Kostas Kloudas 
> > > wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> As discussed in [1] , the Program interface seems to be outdated
> and
> > > >>> there seems to be
> > > >>> no objection to remove it.
> > > >>>
> > > >>> Given that this interface is PublicEvolving, its removal should
> pass
> > > >>> through a FLIP and
> > > >>> this discussion and the associated FLIP are exactly for that
> purpose.
> > > >>>
> > > >>> Please let me know what you think and if it is ok to proceed with
> its
> > > >>> removal.
> > > >>>
> > > >>> Cheers,
> > > >>> Kostas
> > > >>>
> > > >>> link to FLIP-52 :
> > > >>>
> > > >>
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125308637
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>
> > >
> https://lists.apache.org/x/thread.html/7ffc9936a384b891dbcf0a481d26c6d13b2125607c200577780d1e18@%3Cdev.flink.apache.org%3E
> > > >>>
> > > >>
> > >
> > >
>

Re: [DISCUSS] FLIP-52: Remove legacy Program interface.

2019-08-15 Thread Kostas Kloudas

Thanks a lot for the quick response!
I will consider the Flink Accepted and will start working on it.

Cheers,
Kostas

On Thu, Aug 15, 2019 at 5:29 AM SHI Xiaogang  wrote:
>
> +1
>
> Glad that programming with flink becomes simpler and easier.
>
> Regards,
> Xiaogang
>
> Aljoscha Krettek  于2019年8月14日周三 下午11:31写道：
>
> > +1 (for the same reasons I posted on the other thread)
> >
> > > On 14. Aug 2019, at 15:03, Zili Chen  wrote:
> > >
> > > +1
> > >
> > > It could be regarded as part of Flink client api refactor.
> > > Removal of stale code paths helps reason refactor.
> > >
> > > There is one thing worth attention that in this thread[1] Thomas
> > > suggests an interface with a method return JobGraph based on the
> > > fact that REST API and in per job mode actually extracts the JobGraph
> > > from user program and submit it instead of running user program and
> > > submission happens inside the program in session scenario.
> > >
> > > Such an interface would be like
> > >
> > > interface Program {
> > >  JobGraph getJobGraph(args);
> > > }
> > >
> > > Anyway, the discussion above could be continued in that thread.
> > > Current Program is a legacy class that quite less useful than it should
> > be.
> > >
> > > Best,
> > > tison.
> > >
> > > [1]
> > >
> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/REST-API-JarRunHandler-More-flexibility-for-launching-jobs-td31026.html#a31168
> > >
> > >
> > > Stephan Ewen  于2019年8月14日周三 下午7:50写道：
> > >
> > >> +1
> > >>
> > >> the "main" method is the overwhelming default. getting rid of "two ways
> > to
> > >> do things" is a good idea.
> > >>
> > >> On Wed, Aug 14, 2019 at 1:42 PM Kostas Kloudas 
> > wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> As discussed in [1] , the Program interface seems to be outdated and
> > >>> there seems to be
> > >>> no objection to remove it.
> > >>>
> > >>> Given that this interface is PublicEvolving, its removal should pass
> > >>> through a FLIP and
> > >>> this discussion and the associated FLIP are exactly for that purpose.
> > >>>
> > >>> Please let me know what you think and if it is ok to proceed with its
> > >>> removal.
> > >>>
> > >>> Cheers,
> > >>> Kostas
> > >>>
> > >>> link to FLIP-52 :
> > >>>
> > >>
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125308637
> > >>>
> > >>> [1]
> > >>>
> > >>
> > https://lists.apache.org/x/thread.html/7ffc9936a384b891dbcf0a481d26c6d13b2125607c200577780d1e18@%3Cdev.flink.apache.org%3E
> > >>>
> > >>
> >
> >

Re: Checkpointing under backpressure

2019-08-15 Thread Yun Gao

Hi,
Very thanks for the great points!

For the prioritizing inputs, from another point of view, I think it might 
not cause other bad effects, since we do not need to totally block the channels 
that have seen barriers after the operator has taking snapshot. After the 
snapshotting, if the channels that has not seen barriers have buffers, we could 
first logging and processing these buffers and if they do not have buffers, we 
can still processing the buffers from the channels that has seen barriers. 
Therefore, It seems prioritizing inputs should be able to accelerate the 
checkpoint without other bad effects.

   and @zhijiangFor making the unaligned checkpoint the only mechanism for all 
cases, I still think we should allow a configurable timeout after receiving the 
first barrier so that the channels may get "drained" during the timeout, as 
pointed out by Stephan. With such a timeout, we are very likely not need to 
snapshot the input buffers, which would be very similar to the current aligned 
checkpoint mechanism. 

Best, 
Yun

--
From:zhijiang 
Send Time:2019 Aug. 15 (Thu.) 02:22
To:dev 
Subject:Re: Checkpointing under backpressure

> For the checkpoint to complete, any buffer that
> arrived prior to the barrier would be to be part of the checkpointed state.

Yes, I agree.

> So wouldn't it be important to finish persisting these buffers as fast as
> possible by prioritizing respective inputs? The task won't be able to
> process records from the inputs that have seen the barrier fast when it is
> already backpressured (or causing the backpressure).

My previous understanding of prioritizing inputs is from task processing aspect 
after snapshot state. If from the persisting buffers aspect, I think it might 
be up to how we implement it.
If we only tag/reference which buffers in inputs be the part of state, and make 
the real persisting work is done in async way. That means the already tagged 
buffers could be processed by task w/o priority.
And only after all the persisting work done, the task would report to 
coordinator of finished checkpoint on its side. The key point is how we 
implement to make task could continue processing buffers as soon as possible.

Thanks for the further explannation of requirements for speeding up checkpoints 
in backpressure scenario. To make the savepoint finish quickly and then tune 
the setting to avoid backpressure is really a pratical case. I think this 
solution could cover this concern.

Best,
Zhijiang
--
From:Thomas Weise 
Send Time:2019年8月14日(星期三) 19:48
To:dev ; zhijiang 
Subject:Re: Checkpointing under backpressure

-->

On Wed, Aug 14, 2019 at 10:23 AM zhijiang
 wrote:

> Thanks for these great points and disccusions!
>
> 1. Considering the way of triggering checkpoint RPC calls to all the tasks
> from Chandy Lamport, it combines two different mechanisms together to make
> sure that the trigger could be fast in different scenarios.
> But in flink world it might be not very worth trying that way, just as
> Stephan's analysis for it. Another concern is that it might bring more
> heavy loads for JobMaster broadcasting this checkpoint RPC to all the tasks
> in large scale job, especially for the very short checkpoint interval.
> Furthermore it would also cause other important RPC to be executed delay to
> bring potentail timeout risks.
>
> 2. I agree with the idea of drawing on the way "take state snapshot on
> first barrier" from Chandy Lamport instead of barrier alignment combining
> with unaligned checkpoints in flink.
>
> >  The benefit would be less latency increase in the channels which
> already have received barriers.
> >  However, as mentioned before, not prioritizing the inputs from
> which barriers are still missing can also have an adverse effect.
>
> I think we will not have an adverse effect if not prioritizing the inputs
> w/o barriers in this case. After sync snapshot, the task could actually
> process any input channels. For the input channel receiving the first
> barrier, we already have the obvious boundary for persisting buffers. For
> other channels w/o barriers we could persist the following buffers for
> these channels until barrier arrives in network. Because based on the
> credit based flow control, the barrier does not need credit to transport,
> then as long as the sender overtakes the barrier accross the output queue,
> the network stack would transport this barrier immediately no matter with
> the inputs condition on receiver side. So there is no requirements to
> consume accumulated buffers in these channels for higher priority. If so it
> seems that we will not waste any CPU cycles as Piotr concerns before.
>

I'm not sure I follow this. For the checkpoint to complete, any buffer that
arrived prior to the barrier would be to be part of the checkpointed state.
So wouldn't it be important to

[DISCUSS] Reducing build times

2019-08-15 Thread Chesnay Schepler


Hello everyone,

improving our build times is a hot topic at the moment so let's discuss 
the different ways how they could be reduced.



   Current state:

First up, let's look at some numbers:

1 full build currently consumes 5h of build time total ("total time"), 
and in the ideal case takes about 1h20m ("run time") to complete from 
start to finish. The run time may fluctuate of course depending on the 
current Travis load. This applies both to builds on the Apache and 
flink-ci Travis.


At the time of writing, the current queue time for PR jobs (reminder: 
running on flink-ci) is about 30 minutes (which basically means that we 
are processing builds at the rate that they come in), however we are in 
an admittedly quiet period right now.
2 weeks ago the queue times on flink-ci peaked at around 5-6h as 
everyone was scrambling to get their changes merged in time for the 
feature freeze.


(Note: Recently optimizations where added to ci-bot where pending builds 
are canceled if a new commit was pushed to the PR or the PR was closed, 
which should prove especially useful during the rush hours we see before 
feature-freezes.)



   Past approaches

Over the years we have done rather few things to improve this situation 
(hence our current predicament).


Beyond the sporadic speedup of some tests, the only notable reduction in 
total build times was the introduction of cron jobs, which consolidated 
the per-commit matrix from 4 configurations (different scala/hadoop 
versions) to 1.


The separation into multiple build profiles was only a work-around for 
the 50m limit on Travis. Running tests in parallel has the obvious 
potential of reducing run time, but we're currently hitting a hard limit 
since a few modules (flink-tests, flink-runtime, 
flink-table-planner-blink) are so loaded with tests that they nearly 
consume an entire profile by themselves (and thus no further splitting 
is possible).


The rework that introduced stages, at the time of introduction, did also 
not provide a speed up, although this changed slightly once more 
profiles were added and some optimizations to the caching have been made.


Very recently we modified the surefire-plugin configuration for 
flink-table-planner-blink to reuse JVM forks for IT cases, providing a 
significant speedup (18 minutes!). So far we have not seen any negative 
consequences.



   Suggestions

This is a list of /all /suggestions for reducing run/total times that I 
have seen recently (in other words, they aren't necessarily mine nor may 
I agree with all of them).


1. Enable JVM reuse for IT cases in more modules.
 * We've seen significant speedups in the blink planner, and this
   should be applicable for all modules. However, I presume there's
   a reason why we disabled JVM reuse (information on this would be
   appreciated)
2. Custom differential build scripts
 * Setup custom scripts for determining which modules might be
   affected by change, and manipulate the splits accordingly. This
   approach is conceptually quite straight-forward, but has limits
   since it has to be pessimistic; i.e. a change in flink-core
   _must_ result in testing all modules.
3. Only run smoke tests when PR is opened, run heavy tests on demand.
 * With the introduction of the ci-bot we now have significantly
   more options on how to handle PR builds. One option could be to
   only run basic tests when the PR is created (which may be only
   modified modules, or all unit tests, or another low-cost
   scheme), and then have a committer trigger other builds (full
   test run, e2e tests, etc...) on demand.
4. Move more tests into cron builds
 * The budget version of 3); move certain tests that are either
   expensive (like some runtime tests that take minutes) or in
   rarely modified modules (like gelly) into cron jobs.
5. Gradle
 * Gradle was brought up a few times for it's built-in support for
   differential builds; basically providing 2) without the overhead
   of maintaining additional scripts.
 * To date no PoC was provided that shows it working in our CI
   environment (i.e., handling splits & caching etc).
 * This is the most disruptive change by a fair margin, as it would
   affect the entire project, developers and potentially users (f
   they build from source).
6. CI service
 * Our current artifact caching setup on Travis is basically a
   hack; we're basically abusing the Travis cache, which is meant
   for long-term caching, to ship build artifacts across jobs. It's
   brittle at times due to timing/visibility issues and on branches
   the cleanup processes can interfere with running builds. It is
   also not as effective as it could be.
 * There are CI services that provide build artifact caching out of
   the box, which could be useful for us.
 * To date, no PoC for using another CI service has been pr

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-15 Thread Fabian Hueske

Congrats Andrey!

Am Do., 15. Aug. 2019 um 07:58 Uhr schrieb Gary Yao :

> Congratulations Andrey, well deserved!
>
> Best,
> Gary
>
> On Thu, Aug 15, 2019 at 7:50 AM Bowen Li  wrote:
>
> > Congratulations Andrey!
> >
> > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong  wrote:
> >
> >> Congratulations Andrey!
> >>
> >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok  wrote:
> >>
> >> > Congratulations Andrey!
> >> > At 2019-08-14 21:26:37, "Till Rohrmann"  wrote:
> >> > >Hi everyone,
> >> > >
> >> > >I'm very happy to announce that Andrey Zagrebin accepted the offer of
> >> the
> >> > >Flink PMC to become a committer of the Flink project.
> >> > >
> >> > >Andrey has been an active community member for more than 15 months.
> He
> >> has
> >> > >helped shaping numerous features such as State TTL, FRocksDB release,
> >> > >Shuffle service abstraction, FLIP-1, result partition management and
> >> > >various fixes/improvements. He's also frequently helping out on the
> >> > >user@f.a.o mailing lists.
> >> > >
> >> > >Congratulations Andrey!
> >> > >
> >> > >Best, Till
> >> > >(on behalf of the Flink PMC)
> >> >
> >>
> >
>

Re: Checkpointing under backpressure

2019-08-15 Thread Stephan Ewen

@Thomas just to double check:

  - parallelism and configuration changes should be well possible on
unaligned checkpoints
  - changes in state types and JobGraph structure would be tricky, and
changing the on-the-wire types would not be possible.

On Wed, Aug 14, 2019 at 7:48 PM Thomas Weise  wrote:

> -->
>
> On Wed, Aug 14, 2019 at 10:23 AM zhijiang
>  wrote:
>
> > Thanks for these great points and disccusions!
> >
> > 1. Considering the way of triggering checkpoint RPC calls to all the
> tasks
> > from Chandy Lamport, it combines two different mechanisms together to
> make
> > sure that the trigger could be fast in different scenarios.
> > But in flink world it might be not very worth trying that way, just as
> > Stephan's analysis for it. Another concern is that it might bring more
> > heavy loads for JobMaster broadcasting this checkpoint RPC to all the
> tasks
> > in large scale job, especially for the very short checkpoint interval.
> > Furthermore it would also cause other important RPC to be executed delay
> to
> > bring potentail timeout risks.
> >
> > 2. I agree with the idea of drawing on the way "take state snapshot on
> > first barrier" from Chandy Lamport instead of barrier alignment combining
> > with unaligned checkpoints in flink.
> >
> > >  The benefit would be less latency increase in the channels which
> > already have received barriers.
> > >  However, as mentioned before, not prioritizing the inputs from
> > which barriers are still missing can also have an adverse effect.
> >
> > I think we will not have an adverse effect if not prioritizing the inputs
> > w/o barriers in this case. After sync snapshot, the task could actually
> > process any input channels. For the input channel receiving the first
> > barrier, we already have the obvious boundary for persisting buffers. For
> > other channels w/o barriers we could persist the following buffers for
> > these channels until barrier arrives in network. Because based on the
> > credit based flow control, the barrier does not need credit to transport,
> > then as long as the sender overtakes the barrier accross the output
> queue,
> > the network stack would transport this barrier immediately no matter with
> > the inputs condition on receiver side. So there is no requirements to
> > consume accumulated buffers in these channels for higher priority. If so
> it
> > seems that we will not waste any CPU cycles as Piotr concerns before.
> >
>
> I'm not sure I follow this. For the checkpoint to complete, any buffer that
> arrived prior to the barrier would be to be part of the checkpointed state.
> So wouldn't it be important to finish persisting these buffers as fast as
> possible by prioritizing respective inputs? The task won't be able to
> process records from the inputs that have seen the barrier fast when it is
> already backpressured (or causing the backpressure).
>
>
> >
> > 3. Suppose the unaligned checkpoints performing well in practice, is it
> > possible to make it as the only mechanism for handling all the cases? I
> > mean for the non-backpressure scenario, there are less buffers even empty
> > in input/output queue, then the "overtaking barrier--> trigger snapshot
> on
> > first barrier--> persist buffers" might still work well. So we do not
> need
> > to maintain two suits of mechanisms finally.
> >
> > 4.  The initial motivation of this dicussion is for checkpoint timeout in
> > backpressure scenario. If we adjust the default timeout to a very big
> > value, that means the checkpoint would never timeout and we only need to
> > wait it finish. Then are there still any other problems/concerns if
> > checkpoint takes long time to finish? Althougn we already knew some
> issues
> > before, it is better to gather more user feedbacks to confirm which
> aspects
> > could be solved in this feature design. E.g. the sink commit delay might
> > not be coverd by unaligned solution.
> >
>
> Checkpoints taking too long is the concern that sparks this discussion
> (timeout is just a symptom). The slowness issue also applies to the
> savepoint use case. We would need to be able to take a savepoint fast in
> order to roll forward a fix that can alleviate the backpressure (like
> changing parallelism or making a different configuration change).
>
>
> >
> > Best,
> > Zhijiang
> > --
> > From:Stephan Ewen 
> > Send Time:2019年8月14日(星期三) 17:43
> > To:dev 
> > Subject:Re: Checkpointing under backpressure
> >
> > Quick note: The current implementation is
> >
> > Align -> Forward -> Sync Snapshot Part (-> Async Snapshot Part)
> >
> > On Wed, Aug 14, 2019 at 5:21 PM Piotr Nowojski 
> > wrote:
> >
> > > > Thanks for the great ideas so far.
> > >
> > > +1
> > >
> > > Regarding other things raised, I mostly agree with Stephan.
> > >
> > > I like the idea of simultaneously starting the checkpoint everywhere
> via
> > > RPC call (especially in cases where Tasks are busy doing some
> s

52 matches

Mail list logo