[jira] [Created] (FLINK-17794) Tear down installed software in reverse order in Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)
Gary Yao created FLINK-17794:


 Summary: Tear down installed software in reverse order in Jepsen 
Tests
 Key: FLINK-17794
 URL: https://issues.apache.org/jira/browse/FLINK-17794
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.1, 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Tear down installed software in reverse order in Jepsen Tests. This mitigates 
the issue that sometimes hadoop's node manager directories cannot be removed 
using {{rm -rf}} because Flink processes keep running and generate files after 
the YARN NodeManager is shut down. {{rm -r}} removes files recursively but if 
files are created in the background concurrently, the command can still fail 
with a non-zero exit code.

{noformat}
sh -c \"cd /; rm -rf /opt/hadoop\"", :exit 1, :out "", :err "rm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-3587fdbb-15be-4482-94f2-338bfe6b1acc/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_5271c210329e73bd743f3227edfb3b71__27_30__uuid_02dbbf1e-d2d5-43e8-ab34-040345f96476/db':
 Directory not empty\nrm: cannot remove 
'/opt/hadoop/tmp/nm-local-dir/usercache/root/appcache/application_1587567275082_0001/flink-io-d14f2078-74ee-4b8b-aafe-4299577f214f/job_77be6dd9f1b2aa218348e8b8a2512660_op_StreamMap_7d23c6ceabda05a587f0217e44f21301__17_30__uuid_2de2b67d-0767-4e32-99f0-ddd291460947/db':
 Directory not empty
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17792) Failing to invoking jstack on TM processes should not fail Jepsen Tests

2020-05-18 Thread Gary Yao (Jira)
Gary Yao created FLINK-17792:


 Summary: Failing to invoking jstack on TM processes should not 
fail Jepsen Tests
 Key: FLINK-17792
 URL: https://issues.apache.org/jira/browse/FLINK-17792
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.1, 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


{{jstack}} can fail if the JVM process exits prematurely while or before we 
invoke {{jstack}}. If {{jstack}} fails, the exception propagates and exits the 
Jepsen Tests prematurely.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17777) Make Mesos Jepsen Tests pass with Hadoop-free Flink

2020-05-17 Thread Gary Yao (Jira)
Gary Yao created FLINK-1:


 Summary: Make Mesos Jepsen Tests pass with Hadoop-free Flink 
 Key: FLINK-1
 URL: https://issues.apache.org/jira/browse/FLINK-1
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17687) Collect TaskManager logs in Mesos Jepsen Tests

2020-05-14 Thread Gary Yao (Jira)
Gary Yao created FLINK-17687:


 Summary: Collect TaskManager logs in Mesos Jepsen Tests
 Key: FLINK-17687
 URL: https://issues.apache.org/jira/browse/FLINK-17687
 Project: Flink
  Issue Type: Improvement
  Components: Tests
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17621) Use default akka.ask.timeout in TPC-DS e2e test

2020-05-11 Thread Gary Yao (Jira)
Gary Yao created FLINK-17621:


 Summary: Use default akka.ask.timeout in TPC-DS e2e test
 Key: FLINK-17621
 URL: https://issues.apache.org/jira/browse/FLINK-17621
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination, Tests
Affects Versions: 1.11.0
Reporter: Gary Yao


Revert the changes in FLINK-17616



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17616) Temporarily increase akka.ask.timeout in TPC-DS e2e test

2020-05-11 Thread Gary Yao (Jira)
Gary Yao created FLINK-17616:


 Summary: Temporarily increase akka.ask.timeout in TPC-DS e2e test
 Key: FLINK-17616
 URL: https://issues.apache.org/jira/browse/FLINK-17616
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination, Tests
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17558) Partitions are released in TaskExecutor Main Thread

2020-05-07 Thread Gary Yao (Jira)
Gary Yao created FLINK-17558:


 Summary: Partitions are released in TaskExecutor Main Thread
 Key: FLINK-17558
 URL: https://issues.apache.org/jira/browse/FLINK-17558
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.11.0
Reporter: Gary Yao
 Fix For: 1.11.0


Partitions are released in the main thread of the TaskExecutor (see the 
stacktrace below). This can lead to missed heartbeats, timeouts of RPCs, etc. 
because deleting files is blocking I/O. The partitions should be released in a 
devoted I/O thread pool ({{TaskExecutor#ioExecutor}} is a candidate). 

{noformat}
2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 
prio=5 os_prio=0 tid=0x7f7fcc071000 nid=0x1f3f9 runnable 
[0x7f7fd302c000]
2020-05-06T19:13:12.4383983Zjava.lang.Thread.State: RUNNABLE
2020-05-06T19:13:12.4384519Zat 
sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method)
2020-05-06T19:13:12.4384971Zat 
sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146)
2020-05-06T19:13:12.4385465Zat 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231)
2020-05-06T19:13:12.4386000Zat 
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
2020-05-06T19:13:12.4386458Zat java.nio.file.Files.delete(Files.java:1126)
2020-05-06T19:13:12.4386968Zat 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93)
2020-05-06T19:13:12.4388088Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247)
2020-05-06T19:13:12.4388765Zat 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208)
2020-05-06T19:13:12.4389444Z- locked <0xff836d78> (a 
java.lang.Object)
2020-05-06T19:13:12.4389905Zat 
org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290)
2020-05-06T19:13:12.4390481Zat 
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80)
2020-05-06T19:13:12.4391118Z- locked <0x9d452b90> (a 
java.util.HashMap)
2020-05-06T19:13:12.4391597Zat 
org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153)
2020-05-06T19:13:12.4392267Zat 
org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62)
2020-05-06T19:13:12.4392914Zat 
org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776)
2020-05-06T19:13:12.4393366Zat 
sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
2020-05-06T19:13:12.4393813Zat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-05-06T19:13:12.4394257Zat 
java.lang.reflect.Method.invoke(Method.java:498)
2020-05-06T19:13:12.4394693Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
2020-05-06T19:13:12.4395202Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)
2020-05-06T19:13:12.4395686Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
2020-05-06T19:13:12.4396165Zat 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown
 Source)
2020-05-06T19:13:12.4396606Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
2020-05-06T19:13:12.4397015Zat 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
2020-05-06T19:13:12.4397447Zat 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
2020-05-06T19:13:12.4397874Zat 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
2020-05-06T19:13:12.4398414Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
2020-05-06T19:13:12.4398879Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399321Zat 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399737Zat 
akka.actor.Actor$class.aroundReceive(Actor.scala:517)
2020-05-06T19:13:12.4400138Zat 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
2020-05-06T19:13:12.4400552Zat 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
2020-05-06T19:13:12.4400930Zat 
akka.actor.ActorCell.invoke(ActorCell.scala:561)
2020-05-06T19:13:12.4401390Zat 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
2020-05-06T19:13:12.4401763Zat akka.dispatch.Mailbox.run(Mailbox.scala:225)
2020-0

[jira] [Created] (FLINK-17522) Document flink-jepsen Command Line Options

2020-05-05 Thread Gary Yao (Jira)
Gary Yao created FLINK-17522:


 Summary: Document flink-jepsen Command Line Options
 Key: FLINK-17522
 URL: https://issues.apache.org/jira/browse/FLINK-17522
 Project: Flink
  Issue Type: Improvement
  Components: Tests
Reporter: Gary Yao
Assignee: Gary Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17501) Improve logging in AbstractServerHandler#channelRead(ChannelHandlerContext, Object)

2020-05-04 Thread Gary Yao (Jira)
Gary Yao created FLINK-17501:


 Summary: Improve logging in 
AbstractServerHandler#channelRead(ChannelHandlerContext, Object)
 Key: FLINK-17501
 URL: https://issues.apache.org/jira/browse/FLINK-17501
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Queryable State
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Improve logging in {{AbstractServerHandler#channelRead(ChannelHandlerContext, 
Object)}}. If an Error is thrown, it should be logged as early as possible. 
Currently we try to serialize and send an error response to the client before 
logging the error; this can fail and mask the original exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17473) Remove unused classes ArchivedExecutionVertexBuilder and ArchivedExecutionJobVertexBuilder

2020-04-30 Thread Gary Yao (Jira)
Gary Yao created FLINK-17473:


 Summary: Remove unused classes ArchivedExecutionVertexBuilder and 
ArchivedExecutionJobVertexBuilder
 Key: FLINK-17473
 URL: https://issues.apache.org/jira/browse/FLINK-17473
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Tests
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Remove unused classes {{ArchivedExecutionVertexBuilder}} and 
{{ArchivedExecutionJobVertexBuilder}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17266) WorkerResourceSpec is not serializable

2020-04-20 Thread Gary Yao (Jira)
Gary Yao created FLINK-17266:


 Summary: WorkerResourceSpec is not serializable
 Key: FLINK-17266
 URL: https://issues.apache.org/jira/browse/FLINK-17266
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Mesos
Affects Versions: 1.11.0
Reporter: Gary Yao
 Fix For: 1.11.0


{{MesosResourceManager}} cannot acquire new resources due to 
{{WorkerResourceSpec}} not being serializable.

{code}
Caused by: java.lang.Exception: Could not open output stream for state backend
at 
org.apache.flink.runtime.zookeeper.filesystem.FileSystemStateStorageHelper.store(FileSystemStateStorageHelper.java:70)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:136)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.mesos.runtime.clusterframework.store.ZooKeeperMesosWorkerStore.putWorker(ZooKeeperMesosWorkerStore.java:216)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.mesos.runtime.clusterframework.MesosResourceManager.startNewWorker(MesosResourceManager.java:441)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
... 34 more
Caused by: java.io.NotSerializableException: 
org.apache.flink.runtime.resourcemanager.WorkerResourceSpec
at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) 
~[?:1.8.0_242]
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
~[?:1.8.0_242]
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
~[?:1.8.0_242]
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
~[?:1.8.0_242]
at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
~[?:1.8.0_242]
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) 
~[?:1.8.0_242]
at 
org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:594)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.runtime.zookeeper.filesystem.FileSystemStateStorageHelper.store(FileSystemStateStorageHelper.java:62)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:136)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.mesos.runtime.clusterframework.store.ZooKeeperMesosWorkerStore.putWorker(ZooKeeperMesosWorkerStore.java:216)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
at 
org.apache.flink.mesos.runtime.clusterframework.MesosResourceManager.startNewWorker(MesosResourceManager.java:441)
 ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
... 34 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Switch to Azure Pipelines as the primary CI tool / switch off Travis

2020-04-17 Thread Gary Yao
I am in favour of decommissioning Travis.

Moreover, I wanted to use this thread to raise another issue with Travis
that I
have discovered recently; many of the builds running in my private Travis
account are timing out in the compilation stage (i.e., compilation takes
more
than 50 minutes). This means that I am not able to reliably run a full
build on
a CI server without creating a pull request. If other developers also
experience
this issue, it would speak for putting more effort into making Azure
Pipelines
the project-wide default.

Best,
Gary

On Thu, Mar 26, 2020 at 12:26 PM Yu Li  wrote:

> Thanks for the clarification Robert.
>
> Since the first step plan is to replace the travis PR runs, I checked all
> PR builds from 2020-01-01 (PR#10735-11526) [1], and below is the result:
>
> * Travis FAILURE: 298
> * Travis SUCCESS: 649 (68.5%)
> * Azure FAILURE: 420
> * Azure SUCCESS: 571 (57.6%)
>
> Since the patch for each run is equivalent for Travis and Azure, there
> seems to be slightly higher failure rate (~10%) when running in Azure.
>
> However, with the just-merged fix for uploading logs (FLINK-16480), I
> believe the success rate of Azure could compete with Travis now (uploading
> files contribute to 20% of the failures according to the report [2]).
>
> So I'm +1 to disable travis runs according to the numbers.
>
> Best Regards,
> Yu
>
> [1]
> https://github.com/apache/flink/pulls?q=is%3Apr+created%3A%3E%3D2020-01-01
> [2]
>
> https://dev.azure.com/rmetzger/Flink/_pipeline/analytics/stageawareoutcome?definitionId=4
>
> On Thu, 26 Mar 2020 at 03:28, Robert Metzger  wrote:
>
> > Thank you for your responses.
> >
> > @Yu Li: In the current master, the log upload always fails, if the e2e
> job
> > failed. I just merged a PR that fixes this issue [1]. The problem was not
> > really the network stability, rather a problem with the interaction of
> the
> > jobs in the pipeline (the e2e job did not set the right variables for the
> > log upload)
> > Secondly, you are looking at the report of the "flink-ci.flink" pipeline,
> > where pull requests are build. Naturally, pull request builds fail all
> the
> > time, because the PRs are not yet perfect.
> >
> > "flink-ci.flink-master" is the right pipeline to look at:
> >
> >
> https://dev.azure.com/rmetzger/Flink/_pipeline/analytics/stageawareoutcome?definitionId=8=build
> > We have a fairly high number of failures there, because we currently have
> > some issues downloading the maven artifacts [2]. I'm working already with
> > Chesnay on merging a fix for that.
> >
> >
> > [1]
> >
> >
> https://github.com/apache/flink/commit/1c86b8b9dd05615a3b2600984db738a9bf388259
> > [2]https://issues.apache.org/jira/browse/FLINK-16720
> >
> >
> >
> > On Wed, Mar 25, 2020 at 4:48 PM Chesnay Schepler 
> > wrote:
> >
> > > The easiest way to disable travis for pushes is to remove all builds
> > > from the .travis.yml with a push/pr condition.
> > >
> > > On 25/03/2020 15:03, Robert Metzger wrote:
> > > > Thank you for the feedback so far.
> > > >
> > > > Responses to the items Chesnay raised:
> > > >
> > > > - by virtue of maintaining the past 2 releases we will have to
> maintain
> > > any
> > > >> Travis infrastructure as long as 1.10 is supported, i.e., until 1.12
> > > >>
> > > > Okay. I wasn't sure about the exact policy there.
> > > >
> > > >
> > > >> - the azure setup doesn't appear to be equivalent yet since the java
> > e2e
> > > >> profile isn't setting the hadoop switch (-Pe2e-hadoop), as a result
> of
> > > >> which SQLClientKafkaITCase isn't run
> > > >>
> > > > I filed a ticket to address this:
> > > > https://issues.apache.org/jira/browse/FLINK-16778
> > > >
> > > >
> > > >> - the nightly scripts still seems to be using a maven version other
> > than
> > > >> 3.2.5; from today on master:
> > > >> 2020-03-25T05:31:52.7412964Z [INFO] <
> > > >> org.apache.flink:flink-end-to-end-tests-common-kafka >
> > > >> 2020-03-25T05:31:52.7413854Z [INFO] Building
> > > >> flink-end-to-end-tests-common-kafka 1.11-SNAPSHOT [39/46]
> > > >> 2020-03-25T05:31:52.7414689Z [INFO]
> [
> > > jar
> > > >> ]-
> > > >> 2020-03-25T05:31:52.7518360Z [INFO]
> > > >> 2020-03-25T05:31:52.7519770Z [INFO] ---
> > > maven-checkstyle-plugin:2.17:check
> > > >> (validate) @ flink-end-to-end-tests-common-kafka ---
> > > >>
> > > > I'm planning to address this as part of
> > > > https://issues.apache.org/jira/browse/FLINK-16411, where I work on
> > > > centralizing all mvn invocations.
> > > >
> > > >
> > > >> - there is no real benefit in retiring the travis support in CiBot;
> > the
> > > >> important part is whether Travis is run or not for pull requests.
> > > >>  From what I can tell though azure seems to be working fine for pull
> > > >> requests, so +1 to at least disable the travis PR runs.
> > > >
> > > > So we disable Travis for https://github.com/flink-ci/flink ? I will
> do
> > > it
> > > > once there are no 

[jira] [Created] (FLINK-17181) Simplify generic Types in Topology Interface

2020-04-16 Thread Gary Yao (Jira)
Gary Yao created FLINK-17181:


 Summary: Simplify generic Types in Topology Interface
 Key: FLINK-17181
 URL: https://issues.apache.org/jira/browse/FLINK-17181
 Project: Flink
  Issue Type: Sub-task
Reporter: Gary Yao
Assignee: Gary Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17180) Implement PipelinedRegion interface for SchedulingTopology

2020-04-16 Thread Gary Yao (Jira)
Gary Yao created FLINK-17180:


 Summary: Implement PipelinedRegion interface for SchedulingTopology
 Key: FLINK-17180
 URL: https://issues.apache.org/jira/browse/FLINK-17180
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Gary Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17172) Re-enable debug level logging in Jepsen Tests

2020-04-15 Thread Gary Yao (Jira)
Gary Yao created FLINK-17172:


 Summary: Re-enable debug level logging in Jepsen Tests
 Key: FLINK-17172
 URL: https://issues.apache.org/jira/browse/FLINK-17172
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Since log4j2 was enabled, logs in Jepsen tests are on INFO level. We should 
re-enable debug level logging in Jepsen Tests. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17050) Remove methods getVertex() and getResultPartition() from SchedulingTopology

2020-04-08 Thread Gary Yao (Jira)
Gary Yao created FLINK-17050:


 Summary: Remove methods getVertex() and getResultPartition() from 
SchedulingTopology
 Key: FLINK-17050
 URL: https://issues.apache.org/jira/browse/FLINK-17050
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Remove methods (they are only called from tests):
# {{Optional getVertex(ExecutionVertexID)}}
# {{Optional getResultPartition(IntermediateResultPartitionID)}}

and replace them with the following new methods:
# V getVertex(ExecutionVertexID)
# V getResultPartition(IntermediateResultPartitionID)

The new methods should  throw {{IllegalArgumentException}} if no matching 
vertex or result partition could be found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17035) Replace FailoverTopology with SchedulingTopology

2020-04-07 Thread Gary Yao (Jira)
Gary Yao created FLINK-17035:


 Summary: Replace FailoverTopology with SchedulingTopology
 Key: FLINK-17035
 URL: https://issues.apache.org/jira/browse/FLINK-17035
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


Replace usages of {{FailoverTopology}} with {{SchedulingTopology}}. Due to 
their similarities it does not make sense to maintain multiple interfaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] FLIP-119: Pipelined Region Scheduling

2020-04-06 Thread Gary Yao
Hi all,

The voting time for FLIP-119 has passed. I am closing the vote now.

There were 6 +1 votes, 4 of which are binding:
- Till (binding)
- Zhijiang (binding)
- Xintong (non-binding)
- Yangze (non-binding)
- Kurt (binding)
- Zhu Zhu (binding)

There were no disapproving votes.

Thus, FLIP-119 has been accepted.

Best,
Gary

On Fri, Apr 3, 2020 at 12:02 PM Zhu Zhu  wrote:

> +1 (binding)
>
> Thanks,
> Zhu Zhu
>
> Kurt Young  于2020年4月3日周五 下午5:25写道:
>
> > +1 (binding)
> >
> > Best,
> > Kurt
> >
> >
> > On Fri, Apr 3, 2020 at 4:02 PM Yangze Guo  wrote:
> >
> > > +1 (non-binding)
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Fri, Apr 3, 2020 at 2:11 PM Xintong Song 
> > wrote:
> > > >
> > > > +1 (non-binding)
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Apr 3, 2020 at 1:18 PM Zhijiang  > > .invalid>
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Best,
> > > > > Zhijiang
> > > > >
> > > > >
> > > > > --
> > > > > From:Till Rohrmann 
> > > > > Send Time:2020 Apr. 2 (Thu.) 23:09
> > > > > To:dev 
> > > > > Cc:zhuzh 
> > > > > Subject:Re: [VOTE] FLIP-119: Pipelined Region Scheduling
> > > > >
> > > > > +1
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Mar 31, 2020 at 5:52 PM Gary Yao  wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to start the vote for FLIP-119 [1], which is
> discussed
> > > and
> > > > > > reached a consensus in the discussion thread [2].
> > > > > >
> > > > > > The vote will be open until April 3 (72h) unless there is an
> > > objection
> > > > > > or not enough votes.
> > > > > >
> > > > > > Best,
> > > > > > Gary
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling
> > > > > > [2]
> > > > > >
> > > > > >
> > > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-119-Pipelined-Region-Scheduling-tp39350.html
> > > > > >
> > > > >
> > > > >
> > >
> >
>


[jira] [Created] (FLINK-16960) Add PipelinedRegion Interface to Topology

2020-04-03 Thread Gary Yao (Jira)
Gary Yao created FLINK-16960:


 Summary: Add PipelinedRegion Interface to Topology
 Key: FLINK-16960
 URL: https://issues.apache.org/jira/browse/FLINK-16960
 Project: Flink
  Issue Type: Sub-task
Reporter: Gary Yao
Assignee: Gary Yao


{code}
interface Topology {
...
  Iterable getAllPipelinedRegions();

  PipelinedRegion getPipelinedRegionOfVertex(VID vertexId);
...
}

interface PipelinedRegion {
  Iterable getVertices();

  Iterable getVertex(VID vertexId);

  Iterable getConsumedResults();
}

{code}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16940) Avoid creating currentRegion HashSet with manually set initialCapacity

2020-04-02 Thread Gary Yao (Jira)
Gary Yao created FLINK-16940:


 Summary: Avoid creating currentRegion HashSet with manually set 
initialCapacity
 Key: FLINK-16940
 URL: https://issues.apache.org/jira/browse/FLINK-16940
 Project: Flink
  Issue Type: Bug
  Components: Runtime / REST
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


The {{currentRegion}} HashSet in {{PipelinedRegionComputeUtil}} is created with 
an initialCapacity of 1. This is wrong because when we add the first element, 
the sets capacity will be already increased. From the style guidelines:
{quote}
Set the initial capacity for a collection only if there is a good proven reason 
for that, otherwise do not clutter the code. In case of Maps it can be even 
deluding because the Map’s load factor effectively reduces the capacity.
{quote}
https://flink.apache.org/contributing/code-style-and-quality-java.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16939) TaskManagerMessageParameters#taskManagerIdParameter is not declared final

2020-04-02 Thread Gary Yao (Jira)
Gary Yao created FLINK-16939:


 Summary: TaskManagerMessageParameters#taskManagerIdParameter is 
not declared final
 Key: FLINK-16939
 URL: https://issues.apache.org/jira/browse/FLINK-16939
 Project: Flink
  Issue Type: Bug
  Components: Runtime / REST
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


The field {{TaskManagerMessageParameters#taskManagerIdParameter}} is not 
declared final but it should be.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[VOTE] FLIP-119: Pipelined Region Scheduling

2020-03-31 Thread Gary Yao
Hi all,

I would like to start the vote for FLIP-119 [1], which is discussed and
reached a consensus in the discussion thread [2].

The vote will be open until April 3 (72h) unless there is an objection
or not enough votes.

Best,
Gary

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-119-Pipelined-Region-Scheduling-tp39350.html


Re: [DISCUSS] FLIP-119: Pipelined Region Scheduling

2020-03-30 Thread Gary Yao
l this request.
> > >
> > > A small comment concerning "Resource deadlocks when slot allocation
> > > competition happens between multiple jobs in a session cluster":
> Another
> > > idea to solve this situation would be to give the ResourceManager the
> > right
> > > to revoke slot assignments in order to change the mapping between
> > requests
> > > and available slots.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Mar 27, 2020 at 12:44 PM Xintong Song 
> > > wrote:
> > >
> > > > Gary & Zhu Zhu,
> > > >
> > > > Thanks for preparing this FLIP, and a BIG +1 from my side. The
> > trade-off
> > > > between resource utilization and potential deadlock problems has
> always
> > > > been a pain. Despite not solving all the deadlock cases, this FLIP is
> > > > definitely a big improvement. IIUC, it has already covered all the
> > > existing
> > > > single job cases, and all the mentioned non-covered cases are either
> in
> > > > multi-job session clusters or with diverse slot resources in future.
> > > >
> > > > I've read through the FLIP, and it looks really good to me. Good job!
> > All
> > > > the concerns and limitations that I can think of have already been
> > > clearly
> > > > stated, with reasonable potential future solutions. From the
> > perspective
> > > of
> > > > fine-grained resource management, I do not see any
> serious/irresolvable
> > > > conflict at this time.
> > > >
> > > > nit: The in-page links are not working. I guess those are copied from
> > > > google docs directly?
> > > >
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Fri, Mar 27, 2020 at 6:26 PM Zhu Zhu  wrote:
> > > >
> > > > > To Yangze,
> > > > >
> > > > > >> the blocking edge will not be consumable before the upstream is
> > > > > finished.
> > > > > Yes. This is how we define a BLOCKING result partition, "Blocking
> > > > > partitions represent blocking data exchanges, where the data stream
> > is
> > > > > first fully produced and then consumed".
> > > > >
> > > > > >> I'm also wondering could we execute the upstream and downstream
> > > > regions
> > > > > at the same time if we have enough resources
> > > > > It may lead to resource waste since the tasks in downstream regions
> > > > cannot
> > > > > read any data before the upstream region finishes. It saves a bit
> > time
> > > on
> > > > > schedule, but usually it does not make much difference for large
> > jobs,
> > > > > since data processing takes much more time. For small jobs, one can
> > > make
> > > > > all edges PIPELINED so that all the tasks can be scheduled at the
> > same
> > > > > time.
> > > > >
> > > > > >> is it possible to change the data exchange mode of two regions
> > > > > dynamically?
> > > > > This is not in the scope of the FLIP. But we are moving forward to
> a
> > > more
> > > > > extensible scheduler (FLINK-10429) and resource aware scheduling
> > > > > (FLINK-10407).
> > > > > So I think it's possible we can have a scheduler in the future
> which
> > > > > dynamically changes the shuffle type wisely regarding available
> > > > resources.
> > > > >
> > > > > Thanks,
> > > > > Zhu Zhu
> > > > >
> > > > > Yangze Guo  于2020年3月27日周五 下午4:49写道:
> > > > >
> > > > > > Thanks for updating!
> > > > > >
> > > > > > +1 for supporting the pipelined region scheduling. Although we
> > could
> > > > > > not prevent resource deadlock in all scenarios, it is really a
> big
> > > > > > step.
> > > > > >
> > > > > > The design generally LGTM.
> > > > > >
> > > > > > One minor thing I want to make sure. If I understand correctly,
> the
> > > > > > blocking edge will not be consumable before the upstream is
> > finished.
> > > > > >

Re: [DISCUSS] Creating a new repo to host Stateful Functions Dockerfiles

2020-03-27 Thread Gary Yao
+1 for a separate repository.

Best,
Gary

On Fri, Mar 27, 2020 at 2:46 AM Hequn Cheng  wrote:

> +1 for a separate repository.
> The dedicated `flink-docker` repo works fine now. We can do it similarly.
>
> Best,
> Hequn
>
> On Fri, Mar 27, 2020 at 1:16 AM Till Rohrmann 
> wrote:
>
> > +1 for a separate repository.
> >
> > Cheers,
> > Till
> >
> > On Thu, Mar 26, 2020 at 5:13 PM Ufuk Celebi  wrote:
> >
> > > +1.
> > >
> > > The repo creation process is a light-weight, automated process on the
> ASF
> > > side. When Patrick Lucas contributed docker-flink back to the Flink
> > > community (as flink-docker), there was virtually no overhead in
> creating
> > > the repository. Reusing build scripts should still be possible at the
> > cost
> > > of some duplication which is fine imo.
> > >
> > > – Ufuk
> > >
> > > On Thu, Mar 26, 2020 at 4:18 PM Stephan Ewen  wrote:
> > > >
> > > > +1 to a separate repository.
> > > >
> > > > It seems to be best practice in the docker community.
> > > > And since it does not add overhead, why not go with the best
> practice?
> > > >
> > > > Best,
> > > > Stephan
> > > >
> > > >
> > > > On Thu, Mar 26, 2020 at 4:15 PM Tzu-Li (Gordon) Tai <
> > tzuli...@apache.org
> > > >
> > > wrote:
> > > >>
> > > >> Hi Flink devs,
> > > >>
> > > >> As part of a Stateful Functions release, we would like to publish
> > > Stateful
> > > >> Functions Docker images to Dockerhub as an official image.
> > > >>
> > > >> Some background context on Stateful Function images, for those who
> are
> > > not
> > > >> familiar with the project yet:
> > > >>
> > > >>- Stateful Function images are built on top of the Flink official
> > > >>images, with additional StateFun dependencies being added.
> > > >>You can take a look at the scripts we currently use to build the
> > > images
> > > >>locally for development purposes [1].
> > > >>- They are quite important for user experience, since building a
> > > Docker
> > > >>image is the recommended go-to deployment mode for StateFun user
> > > >>applications [2].
> > > >>
> > > >>
> > > >> A prerequisite for all of this is to first decide where we host the
> > > >> Stateful Functions Dockerfiles,
> > > >> before we can proceed with the process of requesting a new official
> > > image
> > > >> repository at Dockerhub.
> > > >>
> > > >> We’re proposing to create a new dedicated repo for this purpose,
> > > >> with the name `apache/flink-statefun-docker`.
> > > >>
> > > >> While we did initially consider integrating the StateFun Dockerfiles
> > to
> > > be
> > > >> hosted together with the Flink ones in the existing
> > > `apache/flink-docker`
> > > >> repo, we had the following concerns:
> > > >>
> > > >>- In general, it is a convention that each official Dockerhub
> image
> > > is
> > > >>backed by a dedicated source repo hosting the Dockerfiles.
> > > >>- The `apache/flink-docker` repo already has quite a few
> dedicated
> > > >>tooling and CI smoke tests specific for the Flink images.
> > > >>- Flink and StateFun have separate versioning schemes and
> > independent
> > > >>release cycles. A new Flink release does not necessarily require
> a
> > > >>“lock-step” to release new StateFun images as well.
> > > >>- Considering the above all-together, and the fact that creating
> a
> > > new
> > > >>repo is rather low-effort, having a separate repo would probably
> > make
> > > more
> > > >>sense here.
> > > >>
> > > >>
> > > >> What do you think?
> > > >>
> > > >> Cheers,
> > > >> Gordon
> > > >>
> > > >> [1]
> > > >>
> > >
> > >
> >
> https://github.com/apache/flink-statefun/blob/master/tools/docker/build-stateful-functions.sh
> > > >> [2]
> > > >>
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-statefun-docs-master/deployment-and-operations/packaging.html
> > >
> >
>


[DISCUSS] FLIP-119: Pipelined Region Scheduling

2020-03-26 Thread Gary Yao
Hi community,

In the past releases, we have been working on refactoring Flink's scheduler
with the goal of making the scheduler extensible [1]. We have rolled out
most of the intended refactoring in Flink 1.10, and we think it is now time
to leverage our newly introduced abstractions to implement a new resource
optimized scheduling strategy: Pipelined Region Scheduling.

This scheduling strategy aims at:

* avoidance of resource deadlocks when running batch jobs

* tunable with respect to resource consumption and throughput

More details can be found in the Wiki [2]. We are looking forward to your
feedback.

Best,

Zhu Zhu & Gary

[1] https://issues.apache.org/jira/browse/FLINK-10429

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling


Re: [VOTE] FLIP 116: Unified Memory Configuration for Job Managers

2020-03-25 Thread Gary Yao
+1 (binding)

Best,
Gary

On Wed, Mar 18, 2020 at 3:16 PM Andrey Zagrebin 
wrote:

> Hi All,
>
> The discussion for FLIP-116 looks to be resolved [1].
> Therefore, I start the vote for it.
> The vote will end at 6pm CET on Monday, 23 March.
>
> Best,
> Andrey
>
> [1]
>
> http://mail-archives.apache.org/mod_mbox/flink-dev/202003.mbox/%3CCAJNyZN7AJAU_RUVhnWa7r%2B%2BtXpmUqWFH%2BG0hfoLVBzgRMmAO2w%40mail.gmail.com%3E
>


[jira] [Created] (FLINK-16718) KvStateServerHandlerTest leaks Netty ByteBufs

2020-03-23 Thread Gary Yao (Jira)
Gary Yao created FLINK-16718:


 Summary: KvStateServerHandlerTest leaks Netty ByteBufs
 Key: FLINK-16718
 URL: https://issues.apache.org/jira/browse/FLINK-16718
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Queryable State, Tests
Affects Versions: 1.10.0, 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


The {{KvStateServerHandlerTest}} leaks Netty {{ByteBuf}}s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16710) Log Upload blocks Main Thread in TaskExecutor

2020-03-22 Thread Gary Yao (Jira)
Gary Yao created FLINK-16710:


 Summary: Log Upload blocks Main Thread in TaskExecutor
 Key: FLINK-16710
 URL: https://issues.apache.org/jira/browse/FLINK-16710
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.11.0


Uploading logs to the BlobServer blocks the TaskExecutor's main thread. We 
should introduce an IO thread pool that carries out file system accesses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16590) flink-oss-fs-hadoop: Not all dependencies in NOTICE file are bundled

2020-03-13 Thread Gary Yao (Jira)
Gary Yao created FLINK-16590:


 Summary: flink-oss-fs-hadoop: Not all dependencies in NOTICE file 
are bundled
 Key: FLINK-16590
 URL: https://issues.apache.org/jira/browse/FLINK-16590
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem, Release System
Affects Versions: 1.11.0
Reporter: Gary Yao
 Fix For: 1.11.0


NOTICE file in flink-oss-fs-hadoop lists
{{code}}
org.apache.commons:commons-compress
{{code}}

as a bundled dependency which is not correct. There are likely other 
dependencies that are wrongly listed in the NOTICE file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16445) Raise japicmp.referenceVersion to 1.10.0

2020-03-05 Thread Gary Yao (Jira)
Gary Yao created FLINK-16445:


 Summary: Raise japicmp.referenceVersion to 1.10.0
 Key: FLINK-16445
 URL: https://issues.apache.org/jira/browse/FLINK-16445
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.11.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.11.0


In {{pom.xml}}, change property {{japicmp.referenceVersion}} to {{1.10.0}}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] FLIP-100[NEW]: Add Attempt Information

2020-03-04 Thread Gary Yao
+1 (binding)

Best,
Gary

On Wed, Mar 4, 2020 at 1:18 PM Yadong Xie  wrote:

> Hi all
>
> I want to start the vote for FLIP-100, which proposes to add attempt
> information inside subtask and timeline in web UI.
>
> To help everyone better understand the proposal, we spent some efforts on
> making an online POC
>
> Timeline Attempt (click the vertex timeline to see the differences):
> previous web:
> http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/timeline
> POC web:
>
> http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/timeline
>
> Subtask Attempt (click the vertex and switch to subtask tab to see the
> differences):
> previous web:
> http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/overview
> POC web:
>
> http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/overview
>
>
> The vote will last for at least 72 hours, following the consensus voting
> process.
>
> FLIP wiki:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information
>
> Discussion thread:
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
>
> Thanks,
>
> Yadong
>


Re: [VOTE] FLIP-100: Add Attempt Information

2020-03-03 Thread Gary Yao
Hi Yadong,

Thank you for updating the wiki page.

Only one minor suggestion – I would change:

> If show-history is true return the information of attempt.

to

> If show-history is true, information for all attempts including
previous ones will be returned

That being said, FLIP-100 looks good to me. From my side there is not
anything
else to discuss.

@Kurt and @Jark: Can you look into the improvements that have been made
since
the last time you looked at the PoC? If you are happy, we can restart the
voting.

Best,
Gary

On Tue, Mar 3, 2020 at 2:34 PM Yadong Xie  wrote:

> Hi all
>
> The rest API part has been updated with Gary and Till's suggestions
> here is the link:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information
>
> Yadong Xie  于2020年3月3日周二 下午9:14写道:
>
> > Hi Chesnay
> >
> > most discussions in this vote are about the more feature/demo request in
> > POC or discussion about response format, the main proposal the web UI
> part
> > which is not changed
> >
> > and the discussion about the response is converging, the response format
> > discussion could happen either here or at the code review stage, which
> > would be a minor change from my point of view.
> >
> > Chesnay Schepler  于2020年3月3日周二 下午8:20写道:
> >
> >> I suggest to cancel this vote.
> >> Several discussion items have been brought up during the vote, some of
> >> which are still unresolved, others which resulted in changes to the
> >> proposal.
> >>
> >> My conclusion is that this proposal needs more discussions.
> >>
> >>
> >> On 20/02/2020 10:46, Yadong Xie wrote:
> >> > Hi all
> >> >
> >> > I want to start the vote for FLIP-100, which proposes to add attempt
> >> > information inside subtask and timeline in web UI.
> >> >
> >> > To help everyone better understand the proposal, we spent some efforts
> >> on
> >> > making an online POC
> >> >
> >> > Timeline Attempt (click the vertex timeline to see the differences):
> >> > previous web:
> >> >
> >>
> http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/timeline
> >> > POC web:
> >> >
> >>
> http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/timeline
> >> >
> >> > Subtask Attempt (click the vertex and switch to subtask tab to see the
> >> > differences):
> >> > previous web:
> >> >
> >>
> http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/overview
> >> > POC web:
> >> >
> >>
> http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/overview
> >> >
> >> >
> >> > The vote will last for at least 72 hours, following the consensus
> voting
> >> > process.
> >> >
> >> > FLIP wiki:
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information
> >> >
> >> > Discussion thread:
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
> >> >
> >> > Thanks,
> >> >
> >> > Yadong
> >> >
> >>
> >>
>


Re: [VOTE] FLIP-100: Add Attempt Information

2020-03-02 Thread Gary Yao
Hi Yadong,

Thanks for driving this FLIP. I have a few questions/remarks:

* Why are we duplicating the subtask index in the objects that are
stored in the attempts-time-info array? I thought that all objects in the
same array share the same subtask index.
* Are we confident that the attempts-time-info array does not grow too
large during the lifetime of a job? Should the size of the array be limited?
* Have we considered placing the historic attempts in the same array as
the current attempts, i.e., flatten the arrays? One could toggle the
historic attempts on and off with a query parameter.
* I think 'attempt-history' would be a better name instead of
'attempts-time-info'.

Let me know what you think.

Best,
Gary

On Thu, Feb 20, 2020 at 10:46 AM Yadong Xie  wrote:

> Hi all
>
> I want to start the vote for FLIP-100, which proposes to add attempt
> information inside subtask and timeline in web UI.
>
> To help everyone better understand the proposal, we spent some efforts on
> making an online POC
>
> Timeline Attempt (click the vertex timeline to see the differences):
> previous web:
> http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/timeline
> POC web:
>
> http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/timeline
>
> Subtask Attempt (click the vertex and switch to subtask tab to see the
> differences):
> previous web:
> http://101.132.122.69:8081/#/job/9d651769488466d33e7a607e85203543/overview
> POC web:
>
> http://101.132.122.69:8081/web/#/job/9d651769488466d33e7a607e85203543/overview
>
>
> The vote will last for at least 72 hours, following the consensus voting
> process.
>
> FLIP wiki:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-100%3A+Add+Attempt+Information
>
> Discussion thread:
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
>
> Thanks,
>
> Yadong
>


Re: [VOTE] FLIP-99: Make Max Exception Configurable

2020-02-27 Thread Gary Yao
+1 (binding)

Best,
Gary

On Thu, Feb 20, 2020 at 10:39 AM Yadong Xie  wrote:

> Hi all
>
> I want to start the vote for FLIP-99, which proposes to make the max
> exception configurable in web UI.
>
> To help everyone better understand the proposal, we spent some efforts on
> making an online POC
>
> previous web:
>
> http://101.132.122.69:8081/#/job/543e9dc0cb2cca4433116007f0931d1a/exceptions
> POC web:
>
> http://101.132.122.69:8081/web/#/job/543e9dc0cb2cca4433116007f0931d1a/exceptions
>
> The vote will last for at least 72 hours, following the consensus voting
> process.
>
> FLIP wiki:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-99%3A+Make+Max+Exception+Configurable
>
> Discussion thread:
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-75-Flink-Web-UI-Improvement-Proposal-td33540.html
>
> Thanks,
>
> Yadong
>


[RESULT] [VOTE] Release 1.10.0, release candidate #3

2020-02-10 Thread Gary Yao
I'm happy to announce that we have unanimously approved this release.

There are 16 approving votes, 5 of which are binding:
* Kurt Young (binding)
* Jark Wu (binding)
* Kostas Kloudas (binding)
* Thomas Weise (binding)
* Jincheng Sun (binding)
* Aihua Li (non-binding)
* Zhu Zhu (non-binding)
* Congxian Qiu (non-binding)
* Rui Li (non-binding)
* Jingsong Li (non-binding)
* Yang Wang (non-binding)
* Piotr Nowojski (non-binding)
* Xintong Song (non-binding)
* Benchao Li (non-binding)
* Zili Chen (non-binding)
* Yangze Guo (non-binding)

There are no disapproving votes.

Thanks everyone!


[VOTE] Release 1.10.0, release candidate #3

2020-02-07 Thread Gary Yao
Hi everyone,
Please review and vote on the release candidate #3 for the version 1.10.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-1.10.0-rc3" [5],
* website pull request listing the new release and adding announcement blog
post [6][7].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Yu & Gary

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12345845
[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc3/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1333
[5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc3
[6] https://github.com/apache/flink-web/pull/302
[7] https://github.com/apache/flink-web/pull/301


Re: [VOTE] Release 1.10.0, release candidate #2

2020-02-07 Thread Gary Yao
r-2.7.5-7.0.jar:/Users/jzhang/github/flink/build-target/lib/log4j-1.2.17.jar:/Users/jzhang/github/flink/build-target/lib/flink-table-blink_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-table_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-dist_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-python_2.11-1.10-SNAPSHOT.jar:/Users/jzhang/github/flink/build-target/lib/flink-hadoop-compatibility_2.11-1.9.1.jar*:/Users/jzhang/github/zeppelin/interpreter/flink/*:/Users/jzhang/github/zeppelin/zeppelin-interpreter-api/target/*:/Users/jzhang/github/zeppelin/zeppelin-zengine/target/test-classes/*::/Users/jzhang/github/zeppelin/interpreter/zeppelin-interpreter-api-0.9.0-SNAPSHOT.jar:/Users/jzhang/github/zeppelin/zeppelin-interpreter/target/classes:/Users/jzhang/github/zeppelin/zeppelin-zengine/target/test-classes:/Users/jzhang/github/flink/build-target/opt/flink-python_2.11-1.10-SNAPSHOT.jar
> > > > >
> > > > >
> > > >
> > >
> >
> /Users/jzhang/github/flink/build-target/opt/tmp/flink-python_2.11-1.10-SNAPSHOT.jar
> > > > > org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer
> > 0.0.0.0
> > > > > 52603 flink-shared_process :
> > > > >
> > > > >
> > > > > Jingsong Li  于2020年2月6日周四 下午4:10写道:
> > > > >
> > > > > > Hi Jeff,
> > > > > >
> > > > > >
> > > > > > For FLINK-15935 [1],
> > > > > >
> > > > > > I try to think of it as a non blocker. But it's really an
> important
> > > > > issue.
> > > > > >
> > > > > >
> > > > > > The problem is the class loading order. We want to load the class
> > in
> > > > the
> > > > > > blink-planner.jar, but actually load the class in the
> > > > flink-planner.jar.
> > > > > >
> > > > > >
> > > > > > First of all, the order of class loading is based on the order of
> > > > > > classpath.
> > > > > >
> > > > > >
> > > > > > I just tried, the order of classpath of the folder is depends on
> > the
> > > > > order
> > > > > > of file names.
> > > > > >
> > > > > > -That is to say, our order is OK now: because
> > > > > > flink-table-blink_2.11-1.11-SNAPSHOT.jar is before the name order
> > of
> > > > > > flink-table_2.11-1.11-snapshot.jar.
> > > > > >
> > > > > > -But once I change the name of flink-table_2.11-1.11-SNAPSHOT.jar
> > to
> > > > > > aflink-table_2.11-1.11-SNAPSHOT.jar, an error will be reported.
> > > > > >
> > > > > >
> > > > > > The order of classpaths should be influenced by the ls of Linux.
> By
> > > > > default
> > > > > > the ls command is listing the files in alphabetical order. [1]
> > > > > >
> > > > > >
> > > > > > IDE seems to have another rule, because it is directly load
> > classes.
> > > > Not
> > > > > > jars.
> > > > > >
> > > > > >
> > > > > > So the possible impact of this problem is:
> > > > > >
> > > > > > -IDE with two planners to use watermark, which will report an
> > error.
> > > > > >
> > > > > > -The real production environment, under Linux, will not report
> > errors
> > > > > > because of the above reasons.
> > > > > >
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/FLINK-15935
> > > > > >
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://linuxize.com/post/how-to-list-files-in-linux-using-the-ls-command/
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Jingsong Lee
> > > > > >
> > > > > > On Thu, Feb 6, 2020 at 3:09 PM Jeff Zhang 
> > wrote:
> > > > > >
> > > > > > > -1, I just found one critical issue
> > > > > > > https://issues.apache.org/jira/browse/FLINK-15935
> > > > > > > This ticket means user unable to use watermark in sql if he
> > 

[jira] [Created] (FLINK-15953) Job Status is hard to read for some Statuses

2020-02-07 Thread Gary Yao (Jira)
Gary Yao created FLINK-15953:


 Summary: Job Status is hard to read for some Statuses
 Key: FLINK-15953
 URL: https://issues.apache.org/jira/browse/FLINK-15953
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Affects Versions: 1.9.2, 1.10.0
Reporter: Gary Yao
Assignee: Yadong Xie
 Fix For: 1.10.1, 1.11.0
 Attachments: 769B08ED-D644-4DEB-BA4C-14B18E562A52.png

The job status {{RESTARTING}} is rendered in a white font on white background 
which makes it hard to read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15936) TaskExecutorTest#testSlotAcceptance deadlocks

2020-02-05 Thread Gary Yao (Jira)
Gary Yao created FLINK-15936:


 Summary: TaskExecutorTest#testSlotAcceptance deadlocks
 Key: FLINK-15936
 URL: https://issues.apache.org/jira/browse/FLINK-15936
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.1


https://api.travis-ci.org/v3/job/646510877/log.txt

{noformat}
"main" #1 prio=5 os_prio=0 tid=0x7f2f5800b800 nid=0x499 waiting on 
condition [0x7f2f61733000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x8669b3a8> (a 
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutorTest.testSlotAcceptance(TaskExecutorTest.java:875)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.10.0, release candidate #2

2020-02-05 Thread Gary Yao
Note that there is currently an ongoing discussion about whether FLINK-15917
and FLINK-15918 should be fixed in 1.10.0 [1].

[1]
http://mail-archives.apache.org/mod_mbox/flink-dev/202002.mbox/%3CCA%2B5xAo3D21-T5QysQg3XOdm%3DL9ipz3rMkA%3DqMzxraJRgfuyg2A%40mail.gmail.com%3E

On Wed, Feb 5, 2020 at 8:00 PM Gary Yao  wrote:

> Hi everyone,
> Please review and vote on the release candidate #2 for the version 1.10.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-1.10.0-rc2" [5],
> * website pull request listing the new release and adding announcement
> blog post [6][7].
>
> The vote will be open for at least 24 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Yu & Gary
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12345845
> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc2/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1332
> [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc2
> [6] https://github.com/apache/flink-web/pull/302
> [7] https://github.com/apache/flink-web/pull/301
>


Re: [VOTE] Release 1.10.0, release candidate #1

2020-02-05 Thread Gary Yao
It is indeed unfortunate that these issues are discovered only now. I think
Thomas has a valid point, and we might be risking the trust of our users
here.

What are our options?

1. Document this behavior and how to work around it copiously in the
release notes [1]
2. Try to restore the previous behavior
3. Change default value of jobmanager.scheduler to "legacy" and rollout
the feature in 1.11
4. Change default value of jobmanager.scheduler to "legacy" and rollout
the feature earliest in 1.10.1

[1]
https://github.com/apache/flink/pull/10997/files#diff-b84c5611825842e8f74301ca70d94d23R86

On Wed, Feb 5, 2020 at 7:24 PM Stephan Ewen  wrote:

> Should we make these a blocker? I am not sure - we could also clearly
> state in the release notes how to restore the old behavior, if your setup
> assumes that behavior.
>
> Release candidates for this release have been out since mid December, it
> is a bit unfortunate that these things have been raised so late.
> Having these rather open ended tickets (how to re-define the existing
> metrics in the new scheduler/failover handling) now as release blockers
> would mean that 1.10 is indefinitely delayed.
>
> Are we sure we want to do that?
>
> On Wed, Feb 5, 2020 at 6:53 PM Thomas Weise  wrote:
>
>> Hi Gary,
>>
>> Thanks for the clarification!
>>
>> When we upgrade to a new Flink release, we don't start with a default
>> flink-conf.yaml but upgrade our existing tooling and configuration.
>> Therefore we notice this issue as part of the upgrade to 1.10, and not
>> when
>> we upgraded to 1.9.
>>
>> I would expect many other users to be in the same camp, and therefore
>> consider making these regressions a blocker for 1.10?
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Feb 5, 2020 at 4:53 AM Gary Yao  wrote:
>>
>> > > also notice that the exception causing a restart is no longer
>> displayed
>> > > in the UI, which is probably related?
>> >
>> > Yes, this is also related to the new scheduler. I created FLINK-15917
>> [1]
>> > to
>> > track this. Moreover, I created a ticket about the uptime metric not
>> > resetting
>> > [2]. Both issues already exist in 1.9 if
>> > "jobmanager.execution.failover-strategy" is set to "region", which is
>> the
>> > case
>> > in the default flink-conf.yaml.
>> >
>> > In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to
>> > fall
>> > back to the previous behavior.
>> >
>> > In 1.10, you can still fall back to the previous behavior by setting
>> > "jobmanager.scheduler: legacy" and unsetting
>> > "jobmanager.execution.failover-strategy" in your flink-conf.yaml
>> >
>> > I would not consider these issues blockers since there is a workaround
>> for
>> > them, but of course we would like to see the new scheduler getting some
>> > production exposure. More detailed release notes about the caveats of
>> the
>> > new
>> > scheduler will be added to the user documentation.
>> >
>> >
>> > > The watermark issue was
>> > https://issues.apache.org/jira/browse/FLINK-14470
>> >
>> > This should be fixed now [3].
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/FLINK-15917
>> > [2] https://issues.apache.org/jira/browse/FLINK-15918
>> > [3] https://issues.apache.org/jira/browse/FLINK-8949
>> >
>> > On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise  wrote:
>> >
>> >> Hi Gary,
>> >>
>> >> Thanks for the reply.
>> >>
>> >> -->
>> >>
>> >> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao  wrote:
>> >>
>> >> > Hi Thomas,
>> >> >
>> >> > > 2) Was there a change in how job recovery reflects in the uptime
>> >> metric?
>> >> > > Didn't uptime previously reset to 0 on recovery (now it just keeps
>> >> > > increasing)?
>> >> >
>> >> > The uptime is the difference between the current time and the time
>> when
>> >> the
>> >> > job transitioned to RUNNING state. By default we no longer transition
>> >> the
>> >> > job
>> >> > out of the RUNNING state when restarting. This has something to do
>> with
>> >> the
>> >> > new scheduler which enables pipelined region failover by default [1].
>> >> > Ac

[VOTE] Release 1.10.0, release candidate #2

2020-02-05 Thread Gary Yao
Hi everyone,
Please review and vote on the release candidate #2 for the version 1.10.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-1.10.0-rc2" [5],
* website pull request listing the new release and adding announcement blog
post [6][7].

The vote will be open for at least 24 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Yu & Gary

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12345845
[2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc2/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4] https://repository.apache.org/content/repositories/orgapacheflink-1332
[5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc2
[6] https://github.com/apache/flink-web/pull/302
[7] https://github.com/apache/flink-web/pull/301


Re: [VOTE] Release 1.10.0, release candidate #1

2020-02-05 Thread Gary Yao
> also notice that the exception causing a restart is no longer displayed
> in the UI, which is probably related?

Yes, this is also related to the new scheduler. I created FLINK-15917 [1] to
track this. Moreover, I created a ticket about the uptime metric not
resetting
[2]. Both issues already exist in 1.9 if
"jobmanager.execution.failover-strategy" is set to "region", which is the
case
in the default flink-conf.yaml.

In 1.9, unsetting "jobmanager.execution.failover-strategy" was enough to
fall
back to the previous behavior.

In 1.10, you can still fall back to the previous behavior by setting
"jobmanager.scheduler: legacy" and unsetting
"jobmanager.execution.failover-strategy" in your flink-conf.yaml

I would not consider these issues blockers since there is a workaround for
them, but of course we would like to see the new scheduler getting some
production exposure. More detailed release notes about the caveats of the
new
scheduler will be added to the user documentation.


> The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470

This should be fixed now [3].


[1] https://issues.apache.org/jira/browse/FLINK-15917
[2] https://issues.apache.org/jira/browse/FLINK-15918
[3] https://issues.apache.org/jira/browse/FLINK-8949

On Wed, Feb 5, 2020 at 7:04 AM Thomas Weise  wrote:

> Hi Gary,
>
> Thanks for the reply.
>
> -->
>
> On Tue, Feb 4, 2020 at 5:20 AM Gary Yao  wrote:
>
> > Hi Thomas,
> >
> > > 2) Was there a change in how job recovery reflects in the uptime
> metric?
> > > Didn't uptime previously reset to 0 on recovery (now it just keeps
> > > increasing)?
> >
> > The uptime is the difference between the current time and the time when
> the
> > job transitioned to RUNNING state. By default we no longer transition the
> > job
> > out of the RUNNING state when restarting. This has something to do with
> the
> > new scheduler which enables pipelined region failover by default [1].
> > Actually
> > we enabled pipelined region failover already in the binary distribution
> of
> > Flink 1.9 by setting:
> >
> > jobmanager.execution.failover-strategy: region
> >
> > in the default flink-conf.yaml. Unless you have removed this config
> option
> > or
> > you are using a custom yaml, you should be seeing this behavior in Flink
> > 1.9.
> > If you do not want region failover, set
> >
> > jobmanager.execution.failover-strategy: full
> >
> >
> We are using the default (the jobmanager.execution.failover-strategy
> setting is not present in our flink config).
>
> The change in behavior I see is between the 1.9 based deployment and the
> 1.10 RC.
>
> Our 1.9 branch is here:
> https://github.com/lyft/flink/tree/release-1.9-lyft
>
> I also notice that the exception causing a restart is no longer displayed
> in the UI, which is probably related?
>
>
> >
> > > 1) Is the low watermark display in the UI still broken?
> >
> > I was not aware that this is broken. Is there an issue tracking this bug?
> >
>
> The watermark issue was https://issues.apache.org/jira/browse/FLINK-14470
>
> (I don't have a good way to verify it is fixed at the moment.)
>
> Another problem with this 1.10 RC is that the checkpointAlignmentTime
> metric is missing. (I have not been able to investigate this further yet.)
>
>
> >
> > Best,
> > Gary
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-14651
> >
> > On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise  wrote:
> >
> >> I opened a PR for FLINK-15868
> >> <https://issues.apache.org/jira/browse/FLINK-15868>:
> >> https://github.com/apache/flink/pull/11006
> >>
> >> With that change, I was able to run an application that consumes from
> >> Kinesis.
> >>
> >> I should have data tomorrow regarding the performance.
> >>
> >> Two questions/observations:
> >>
> >> 1) Is the low watermark display in the UI still broken?
> >> 2) Was there a change in how job recovery reflects in the uptime metric?
> >> Didn't uptime previously reset to 0 on recovery (now it just keeps
> >> increasing)?
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >>
> >> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise  wrote:
> >>
> >> > I found another issue with the Kinesis connector:
> >> >
> >> > https://issues.apache.org/jira/browse/FLINK-15868
> >> >
> >> >
> >> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao  wr

[jira] [Created] (FLINK-15918) Uptime Metric not reset on Job Restart

2020-02-05 Thread Gary Yao (Jira)
Gary Yao created FLINK-15918:


 Summary: Uptime Metric not reset on Job Restart
 Key: FLINK-15918
 URL: https://issues.apache.org/jira/browse/FLINK-15918
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.2, 1.10.0
Reporter: Gary Yao
 Fix For: 1.11.0, 1.10.1


*Description*
The {{uptime}} metric is not reset when the job restarts, which is a change in 
behavior compared to Flink 1.8.
This change of behavior exists since 1.9.0 if 
{{jobmanager.execution.failover-strategy: region}} is configured,
which we do in the default flink-conf.yaml.


*Workarounds*
Users that find this behavior problematic can set {{jobmanager.scheduler: 
legacy}} in their {{flink-conf.yaml}}


*How to reproduce*
trivial

*Expected behavior*
This is up for discussion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15917) Root Exception not shown in Web UI

2020-02-05 Thread Gary Yao (Jira)
Gary Yao created FLINK-15917:


 Summary: Root Exception not shown in Web UI
 Key: FLINK-15917
 URL: https://issues.apache.org/jira/browse/FLINK-15917
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.2, 1.10.0
Reporter: Gary Yao
 Fix For: 1.11.0, 1.10.1


*Description*
 On the job details page in the Exceptions → Root Exception tab, exceptions 
that cause the job to restart are not displayed.
 This is already a problem since 1.9.0 if 
{{jobmanager.execution.failover-strategy: region}} is configured,
 which we do in the default flink-conf.yaml.

*Workarounds*
 Users that run into this problem can set {{jobmanager.scheduler: legacy}} in 
their {{flink-conf.yaml}}

*How to reproduce*
 In {{flink-conf.yaml}} set {{restart-strategy: fixed-delay}} so enable job 
restarts.
{noformat}
$ bin/start-cluster.sh
$ bin/flink run -d examples/streaming/TopSpeedWindowing.jar
$ bin/taskmanager.sh stop
{noformat}
Assert that no exception is displayed in the Web UI.

*Expected behavior*
 The stacktrace of the exception should be displayed. Whether the exception 
should be also shown if only a partial region of the job failed is up for 
discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.10.0, release candidate #1

2020-02-04 Thread Gary Yao
Hi Thomas,

> 2) Was there a change in how job recovery reflects in the uptime metric?
> Didn't uptime previously reset to 0 on recovery (now it just keeps
> increasing)?

The uptime is the difference between the current time and the time when the
job transitioned to RUNNING state. By default we no longer transition the
job
out of the RUNNING state when restarting. This has something to do with the
new scheduler which enables pipelined region failover by default [1].
Actually
we enabled pipelined region failover already in the binary distribution of
Flink 1.9 by setting:

jobmanager.execution.failover-strategy: region

in the default flink-conf.yaml. Unless you have removed this config option
or
you are using a custom yaml, you should be seeing this behavior in Flink
1.9.
If you do not want region failover, set

jobmanager.execution.failover-strategy: full


> 1) Is the low watermark display in the UI still broken?

I was not aware that this is broken. Is there an issue tracking this bug?

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-14651

On Tue, Feb 4, 2020 at 2:56 AM Thomas Weise  wrote:

> I opened a PR for FLINK-15868
> <https://issues.apache.org/jira/browse/FLINK-15868>:
> https://github.com/apache/flink/pull/11006
>
> With that change, I was able to run an application that consumes from
> Kinesis.
>
> I should have data tomorrow regarding the performance.
>
> Two questions/observations:
>
> 1) Is the low watermark display in the UI still broken?
> 2) Was there a change in how job recovery reflects in the uptime metric?
> Didn't uptime previously reset to 0 on recovery (now it just keeps
> increasing)?
>
> Thanks,
> Thomas
>
>
>
>
> On Mon, Feb 3, 2020 at 10:55 AM Thomas Weise  wrote:
>
> > I found another issue with the Kinesis connector:
> >
> > https://issues.apache.org/jira/browse/FLINK-15868
> >
> >
> > On Mon, Feb 3, 2020 at 3:35 AM Gary Yao  wrote:
> >
> >> Hi everyone,
> >>
> >> I am hereby canceling the vote due to:
> >>
> >> FLINK-15837
> >> FLINK-15840
> >>
> >> Another RC will be created later today.
> >>
> >> Best,
> >> Gary
> >>
> >> On Mon, Jan 27, 2020 at 10:06 PM Gary Yao  wrote:
> >>
> >> > Hi everyone,
> >> > Please review and vote on the release candidate #1 for the version
> >> 1.10.0,
> >> > as follows:
> >> > [ ] +1, Approve the release
> >> > [ ] -1, Do not approve the release (please provide specific comments)
> >> >
> >> >
> >> > The complete staging area is available for your review, which
> includes:
> >> > * JIRA release notes [1],
> >> > * the official Apache source release and binary convenience releases
> to
> >> be
> >> > deployed to dist.apache.org [2], which are signed with the key with
> >> > fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> >> > * all artifacts to be deployed to the Maven Central Repository [4],
> >> > * source code tag "release-1.10.0-rc1" [5],
> >> >
> >> > The announcement blog post is in the works. I will update this voting
> >> > thread with a link to the pull request soon.
> >> >
> >> > The vote will be open for at least 72 hours. It is adopted by majority
> >> > approval, with at least 3 PMC affirmative votes.
> >> >
> >> > Thanks,
> >> > Yu & Gary
> >> >
> >> > [1]
> >> >
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12345845
> >> > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> >> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> >> > [4]
> >> https://repository.apache.org/content/repositories/orgapacheflink-1325
> >> > [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
> >> >
> >>
> >
>


[jira] [Created] (FLINK-15900) JoinITCase#testRightJoinWithPk failed on Travis

2020-02-04 Thread Gary Yao (Jira)
Gary Yao created FLINK-15900:


 Summary: JoinITCase#testRightJoinWithPk failed on Travis
 Key: FLINK-15900
 URL: https://issues.apache.org/jira/browse/FLINK-15900
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Gary Yao


{noformat}
org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at 
org.apache.flink.table.planner.runtime.stream.sql.JoinITCase.testRightJoinWithPk(JoinITCase.scala:672)
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, 
backoffTimeMS=0)
Caused by: java.lang.Exception: Exception while creating 
StreamOperatorStateContext.
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state 
backend for KeyedProcessOperator_17aecc34cf8aa256be6fe4836cbdf29a_(2/4) from 
any of the 1 provided restore options.
Caused by: java.io.IOException: Failed to acquire shared cache resource for 
RocksDB
Caused by: org.apache.flink.runtime.memory.MemoryAllocationException: Could not 
created the shared memory resource of size 20971520. Not enough memory left to 
reserve from the slot's managed memory.
Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could 
not allocate 20971520 bytes. Only 0 bytes are remaining.

{noformat}

https://api.travis-ci.org/v3/job/645466432/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.10.0, release candidate #1

2020-02-03 Thread Gary Yao
Hi everyone,

I am hereby canceling the vote due to:

FLINK-15837
FLINK-15840

Another RC will be created later today.

Best,
Gary

On Mon, Jan 27, 2020 at 10:06 PM Gary Yao  wrote:

> Hi everyone,
> Please review and vote on the release candidate #1 for the version 1.10.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint BB137807CEFBE7DD2616556710B12A1F89C115E8 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-1.10.0-rc1" [5],
>
> The announcement blog post is in the works. I will update this voting
> thread with a link to the pull request soon.
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Yu & Gary
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12345845
> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc1/
> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> [4] https://repository.apache.org/content/repositories/orgapacheflink-1325
> [5] https://github.com/apache/flink/releases/tag/release-1.10.0-rc1
>


[jira] [Created] (FLINK-15781) Manual Smoke Test with Batch Job

2020-01-28 Thread Gary Yao (Jira)
Gary Yao created FLINK-15781:


 Summary: Manual Smoke Test with Batch Job
 Key: FLINK-15781
 URL: https://issues.apache.org/jira/browse/FLINK-15781
 Project: Flink
  Issue Type: Task
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


Try out a (larger) scale batch job with {{"jobmanager.scheduler: ng"}} enabled 
in {{flink-conf.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15780) Manual Smoke Test of Native Kubernetes Integration

2020-01-28 Thread Gary Yao (Jira)
Gary Yao created FLINK-15780:


 Summary: Manual Smoke Test of Native Kubernetes Integration
 Key: FLINK-15780
 URL: https://issues.apache.org/jira/browse/FLINK-15780
 Project: Flink
  Issue Type: Task
  Components: Deployment / Kubernetes
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


Try out the [native Kubernetes 
integration|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html],
 and report any usability issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15779) Manual Smoke Test of Python API

2020-01-28 Thread Gary Yao (Jira)
Gary Yao created FLINK-15779:


 Summary: Manual Smoke Test of Python API
 Key: FLINK-15779
 URL: https://issues.apache.org/jira/browse/FLINK-15779
 Project: Flink
  Issue Type: Task
  Components: API / Python
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


Try out the [Python 
API|https://ci.apache.org/projects/flink/flink-docs-release-1.10/tutorials/python_table_api.html],
 and report any usability issues.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15754) Remove config options table.exec.resource.*memory from Documentation

2020-01-24 Thread Gary Yao (Jira)
Gary Yao created FLINK-15754:


 Summary: Remove config options table.exec.resource.*memory from 
Documentation
 Key: FLINK-15754
 URL: https://issues.apache.org/jira/browse/FLINK-15754
 Project: Flink
  Issue Type: Task
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


{noformat}
table.exec.resource.external-buffer-memory
table.exec.resource.hash-agg.memory
table.exec.resource.hash-join.memory
table.exec.resource.sort.memory
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15751) RocksDB Memory Management end-to-end test fails

2020-01-24 Thread Gary Yao (Jira)
Gary Yao created FLINK-15751:


 Summary: RocksDB Memory Management end-to-end test fails
 Key: FLINK-15751
 URL: https://issues.apache.org/jira/browse/FLINK-15751
 Project: Flink
  Issue Type: Task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


RocksDB Memory Management end-to-end test fails due to non-whitelisted 
exceptions in the TaskManager log.

{noformat}
org.apache.flink.runtime.execution.CancelTaskException: Consumed partition 
PipelinedSubpartitionView(index: 1) of ResultPartition 
a451ee900d8b1e282d877868cdb69dbe@b9cbd0aed214bb91ce4c4fb0fa6473a6 has been 
released.
at 
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:190)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:509)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:487)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:475)
at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75)
at 
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125)
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527)
at java.lang.Thread.run(Thread.java:748)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15743) Add Flink 1.10 release notes to documentation

2020-01-23 Thread Gary Yao (Jira)
Gary Yao created FLINK-15743:


 Summary: Add Flink 1.10 release notes to documentation
 Key: FLINK-15743
 URL: https://issues.apache.org/jira/browse/FLINK-15743
 Project: Flink
  Issue Type: Task
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


Gather, edit, and add Flink 1.10 release notes to documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[ANNOUNCE] Apache Flink 1.10.0, release candidate #0

2020-01-17 Thread Gary Yao
Hi all,

RC0 for Apache Flink 1.10.0 has been created. This has all the artifacts
that
we would typically have for a release, except for a source code tag and a PR
for the release announcement.

This preview-only RC is created only to drive the current testing efforts,
and
no official vote will take place. It includes the following:

* the preview source release and binary convenience releases [1], which are
signed with the key with fingerprint
BB137807CEFBE7DD2616556710B12A1F89C115E8
[2],
* all artifacts that would normally be deployed to the Maven Central
Repository [3]

To test with these artifacts, you can create a settings.xml file with the
content shown below [4]. This settings file can be referenced in your maven
commands via --settings /path/to/settings.xml. This is useful for creating a
quickstart project based on the staged release and also for building against
the staged jars.

Happy testing!

Best,
Gary

[1] https://dist.apache.org/repos/dist/dev/flink/flink-1.10.0-rc0/
[2] https://dist.apache.org/repos/dist/release/flink/KEYS
[3] https://repository.apache.org/content/repositories/orgapacheflink-1318/
[4]

  
flink-1.10.0
  
  

  flink-1.10.0
  

  flink-1.10.0
  
https://repository.apache.org/content/repositories/orgapacheflink-1318/



  archetype
  
https://repository.apache.org/content/repositories/orgapacheflink-1318/


  

  



[jira] [Created] (FLINK-15638) releasing/create_release_branch.sh does not set version in flink-python/pyflink/version.py

2020-01-17 Thread Gary Yao (Jira)
Gary Yao created FLINK-15638:


 Summary: releasing/create_release_branch.sh does not set version 
in flink-python/pyflink/version.py
 Key: FLINK-15638
 URL: https://issues.apache.org/jira/browse/FLINK-15638
 Project: Flink
  Issue Type: Bug
  Components: Release System
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


{{releasing/create_release_branch.sh}} does not set the version in 
{{flink-python/pyflink/version.py}}. Currently the version.py contains:

{noformat}
__version__ = "1.10.dev0" 
{noformat}

{{setup.py}} will replace .dev0 with -SNAPSHOT and tries to find the respective 
flink distribution in the flink-dist/target, which will not exist.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15623) Buildling flink-python with maven profile docs-and-source fails

2020-01-16 Thread Gary Yao (Jira)
Gary Yao created FLINK-15623:


 Summary: Buildling flink-python with maven profile docs-and-source 
fails
 Key: FLINK-15623
 URL: https://issues.apache.org/jira/browse/FLINK-15623
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.10.0
 Environment: rev: 91d96abe5f42bd088a326870b4885d79611fccb5
Reporter: Gary Yao
 Fix For: 1.10.0


*Description*
Building flink-python with maven profile docs-and-source fails due to 
checkstyle violations. 

*How to reproduce*

Running

{noformat}
mvn clean install -pl flink-python -Pdocs-and-source -DskipTests 
-DretryFailedDeploymentCount=10
{noformat}

should fail with the following error

{noformat}
[...]
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8343] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8344] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8345] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8346] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8347] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8348] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8349] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[ERROR] 
generated-sources/org/apache/flink/fnexecution/v1/FlinkFnApi.java:[8350] 
(regexp) RegexpSinglelineJava: Line has leading space characters; indentation 
should be performed with tabs only.
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 18.046 s
[INFO] Finished at: 2020-01-16T16:44:01+00:00
[INFO] Final Memory: 158M/2826M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on 
project flink-python_2.11: You have 7603 Checkstyle violations. -> [Help 1]
[ERROR]
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15317) State TTL Heap backend end-to-end test fails on Travis

2019-12-18 Thread Gary Yao (Jira)
Gary Yao created FLINK-15317:


 Summary: State TTL Heap backend end-to-end test fails on Travis
 Key: FLINK-15317
 URL: https://issues.apache.org/jira/browse/FLINK-15317
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


https://api.travis-ci.org/v3/job/626286529/log.txt

{noformat}
Checking for errors...
Found error in log files:
...
java.lang.IllegalStateException: Released
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:483)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:474)
at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75)
at 
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125)
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:488)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527)
at java.lang.Thread.run(Thread.java:748)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15316) SQL Client end-to-end test (Old planner) failed on Travis

2019-12-18 Thread Gary Yao (Jira)
Gary Yao created FLINK-15316:


 Summary: SQL Client end-to-end test (Old planner) failed on Travis
 Key: FLINK-15316
 URL: https://issues.apache.org/jira/browse/FLINK-15316
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Client, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


https://api.travis-ci.org/v3/job/626286501/log.txt

{noformat}
Checking for errors...
Found error in log files:
...
org.apache.flink.table.client.gateway.SqlExecutionException: Invalid SQL update 
statement.
at 
org.apache.flink.table.client.gateway.local.LocalExecutor.applyUpdate(LocalExecutor.java:697)
at 
org.apache.flink.table.client.gateway.local.LocalExecutor.executeUpdateInternal(LocalExecutor.java:576)
at 
org.apache.flink.table.client.gateway.local.LocalExecutor.executeUpdate(LocalExecutor.java:527)
at 
org.apache.flink.table.client.cli.CliClient.callInsertInto(CliClient.java:535)
at 
org.apache.flink.table.client.cli.CliClient.lambda$submitUpdate$0(CliClient.java:231)
at java.util.Optional.map(Optional.java:215)
at 
org.apache.flink.table.client.cli.CliClient.submitUpdate(CliClient.java:228)
at org.apache.flink.table.client.SqlClient.openCli(SqlClient.java:129)
at org.apache.flink.table.client.SqlClient.start(SqlClient.java:104)
at org.apache.flink.table.client.SqlClient.main(SqlClient.java:178)
Caused by: java.lang.RuntimeException: Error while applying rule 
PushProjectIntoTableSourceScanRule, args 
[rel#88:FlinkLogicalCalc.LOGICAL(input=RelSubset#87,expr#0..2={inputs},expr#3=IS
 NOT NULL($t1),user=$t1,rowtime=$t0,$condition=$t3), 
Scan(table:[default_catalog, default_database, JsonSourceTable], 
fields:(rowtime, user, event), source:KafkaTableSource(rowtime, user, event))]
at 
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:235)
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:631)
at 
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:327)
at 
org.apache.flink.table.plan.Optimizer.runVolcanoPlanner(Optimizer.scala:280)
at 
org.apache.flink.table.plan.Optimizer.optimizeLogicalPlan(Optimizer.scala:199)
at 
org.apache.flink.table.plan.StreamOptimizer.optimize(StreamOptimizer.scala:66)
at 
org.apache.flink.table.planner.StreamPlanner.writeToUpsertSink(StreamPlanner.scala:353)
at 
org.apache.flink.table.planner.StreamPlanner.org$apache$flink$table$planner$StreamPlanner$$writeToSink(StreamPlanner.scala:281)
at 
org.apache.flink.table.planner.StreamPlanner$$anonfun$2.apply(StreamPlanner.scala:166)
at 
org.apache.flink.table.planner.StreamPlanner$$anonfun$2.apply(StreamPlanner.scala:145)
at scala.Option.map(Option.scala:146)
at 
org.apache.flink.table.planner.StreamPlanner.org$apache$flink$table$planner$StreamPlanner$$translate(StreamPlanner.scala:145)
at 
org.apache.flink.table.planner.StreamPlanner$$anonfun$translate$1.apply(StreamPlanner.scala:117)
at 
org.apache.flink.table.planner.StreamPlanner$$anonfun$translate$1.apply(StreamPlanner.scala:117)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.flink.table.planner.StreamPlanner.translate(StreamPlanner.scala:117)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.translate(TableEnvironmentImpl.java:680)
at 
org.apache.flink.table.api.internal.TableEnvironmentImpl.sqlUpdate(TableEnvironmentImpl.java:493)
at 
org.apache.flink.table.api.java.internal.StreamTableEnvironmentImpl.sqlUpdate(StreamTableEnvironmentImpl.java:331)
at 
org.apache.flink.table.client.gateway.local.LocalExecutor.lambda$applyUpdate$15(LocalExecutor.java:690)
at 
org.apache.flink.table.client.gateway.local.ExecutionContext.wrapClassLoader(ExecutionContext.java:240)
at 
org.apache.flink.table.client.gateway.local.LocalExecutor.applyUpdate(LocalExecutor.java:688)
... 9 more
Caused by: org.apache.flink.table.api.ValidationException: Rowtime field 
'rowtime' has invalid type LocalDateTime. Rowtime attributes must be of type 
Timestamp.
at 
org.apache.flink.table.sources.TableSourceUtil$$anonfun$3.apply

Re: [ANNOUNCE] Zhu Zhu becomes a Flink committer

2019-12-17 Thread Gary Yao
Congratulations, well deserved.

On Tue, Dec 17, 2019 at 10:09 AM lining jing  wrote:

> Congratulations Zhu Zhu~
>
> Andrey Zagrebin  于2019年12月16日周一 下午5:01写道:
>
> > Congrats Zhu Zhu!
> >
> > On Mon, Dec 16, 2019 at 8:10 AM Xintong Song 
> > wrote:
> >
> > > Congratulations Zhu Zhu~
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Mon, Dec 16, 2019 at 12:34 PM Danny Chan 
> > wrote:
> > >
> > > > Congrats Zhu Zhu!
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2019年12月14日 +0800 AM12:51,dev@flink.apache.org,写道:
> > > > >
> > > > > Congrats Zhu Zhu and welcome on board!
> > > >
> > >
> >
>


[jira] [Created] (FLINK-15285) Resuming Externalized Checkpoint (file, async, no parallelism change) E2E test fails on Travis

2019-12-16 Thread Gary Yao (Jira)
Gary Yao created FLINK-15285:


 Summary: Resuming Externalized Checkpoint (file, async, no 
parallelism change) E2E test fails on Travis
 Key: FLINK-15285
 URL: https://issues.apache.org/jira/browse/FLINK-15285
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


{noformat}
==
Running 'Resuming Externalized Checkpoint (file, async, no parallelism change) 
end-to-end test'
==
TEST_DATA_DIR: 
/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-08680387784
Flink dist directory: /home/travis/build/apache/flink/build-target
Starting cluster.
Starting standalonesession daemon on host 
travis-job-9b06ae82-c259-4d51-895c-c3439137e999.
Starting taskexecutor daemon on host 
travis-job-9b06ae82-c259-4d51-895c-c3439137e999.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
Running externalized checkpoints test, with ORIGINAL_DOP=2 NEW_DOP=2 and 
STATE_BACKEND_TYPE=file STATE_BACKEND_FILE_ASYNC=true 
STATE_BACKEND_ROCKSDB_INCREMENTAL=true SIMULATE_FAILURE=false ...
Job (6eb020e02667feb4fd69b2b29be2a9ec) is running.
Waiting for job (6eb020e02667feb4fd69b2b29be2a9ec) to have at least 1 completed 
checkpoints ...
Waiting for job to process up to 200 records, current progress: 93 records ...
Waiting for job to process up to 200 records, current progress: 146 records ...
Cancelling job 6eb020e02667feb4fd69b2b29be2a9ec.
Cancelled job 6eb020e02667feb4fd69b2b29be2a9ec.
Restoring job with externalized checkpoint at 
/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-08680387784/externalized-chckpt-e2e-backend-dir/6eb020e02667feb4fd69b2b29be2a9ec/chk-9
 ...
Job (7604e7773d1f7b54b8b1c8655704aa86) is running.
Waiting for job to process up to 200 records, current progress: 195 records ...
Checking for errors...
Found error in log files:
{noformat}
{noformat}
2019-12-16 11:51:30,518 WARN  
org.apache.flink.streaming.runtime.tasks.StreamTask   - Error while 
canceling task.
java.lang.IllegalStateException: Released
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:483)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.pollNext(SingleInputGate.java:474)
at 
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.pollNext(InputGateWithMetrics.java:75)
at 
org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:125)
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:133)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:69)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:311)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:187)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:488)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527)
at java.lang.Thread.run(Thread.java:748)
{noformat}

https://api.travis-ci.org/v3/job/625630520/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15247) Closing (Testing)MiniCluster may cause ConcurrentModificationException

2019-12-13 Thread Gary Yao (Jira)
Gary Yao created FLINK-15247:


 Summary: Closing (Testing)MiniCluster may cause 
ConcurrentModificationException
 Key: FLINK-15247
 URL: https://issues.apache.org/jira/browse/FLINK-15247
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


{noformat}
at 
org.apache.flink.test.streaming.runtime.BackPressureITCase.tearDown(BackPressureITCase.java:165)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55)
at 
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137)
at 
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:107)
at 
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:83)
at 
org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75)
at 
org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:158)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
Caused by: org.apache.flink.util.FlinkException: Error while shutting the 
TaskExecutor down.
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.handleOnStopException(TaskExecutor.java:397)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$onStop$0(TaskExecutor.java:382)
at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$runAfterwardsAsync$16(FutureUtils.java:518)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
at 
java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:529)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete

Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration

2019-12-12 Thread Gary Yao
Thanks for your explanation. I think the proposal is reasonable.

On Thu, Dec 12, 2019 at 3:32 AM Yangze Guo  wrote:

> Thanks for the feedback, Gary.
>
> Regarding the WordCount test:
> - True. There is no test coverage increment compared to others.
> However, I think each test case better not have multiple purposes so
> that we could find out the root cause quickly. As discussed in
> FLINK-15135[1], I prefer only including WordCount test as the first
> step. If the time overhead of E2E tests become severe in the future, I
> agree to remove it. WDYT?
> - I think the main overhead comes from building the image. The
> subsequent tests will run fast since they will not build it again.
>
> Regarding the Rocks test, I think it is a typical scenario using
> off-heap memory. The main purpose is to verify the memory usage and
> memory configuration in Mesos mode. Two typical use cases are off-heap
> and on-heap. Thus, I think the following two test cases are valuable
> to be included:
> - A streaming task using heap backend. It should explicitly set the
> “taskmanager.memory.managed.size” to zero to check the potential
> unexpected usage of off-heap memory.
> - A streaming task using rocks backend. It covers the scenario using
> off-heap memory.
>
> Look forward to your kind feedback.
>
> [1]https://issues.apache.org/jira/browse/FLINK-15135
>
> Best,
> Yangze Guo
>
>
>
> On Wed, Dec 11, 2019 at 6:14 PM Gary Yao  wrote:
> >
> > Thanks for driving this effort. Also +1 from my side. I have left a few
> > questions below.
> >
> > > - Wordcount end-to-end test. For verifying the basic process of Mesos
> > > deployment.
> >
> > Would this add additional test coverage compared to the
> > "multiple submissions" test case? I am asking because the E2E tests are
> > already
> > expensive to run, and adding new tests should be carefully considered.
> >
> > > - State TTL RocksDb backend end-to-end test. For verifying memory
> > > configuration behaviors, since Mesos has it’s own config options and
> > > logics.
> >
> > Can you elaborate more on this? Which config options are relevant here?
> >
> > On Wed, Dec 11, 2019 at 9:58 AM Till Rohrmann 
> wrote:
> >
> > > +1 for building the image locally. If need should arise, then we could
> > > change it always later.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Dec 11, 2019 at 4:05 AM Xintong Song 
> > > wrote:
> > >
> > > > Thanks, Yangtze.
> > > >
> > > > +1 for building the image locally.
> > > > The time consumption for both building image locally and pulling it
> from
> > > > DockerHub sounds reasonable and affordable. Therefore, I'm also in
> favor
> > > of
> > > > avoiding the cost maintaining a custom image.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Wed, Dec 11, 2019 at 10:11 AM Yangze Guo 
> wrote:
> > > >
> > > > > Thanks for the feedback, Yang.
> > > > >
> > > > > Some updates I want to share in this thread.
> > > > > I have built a PoC version of Meos e2e test with WordCount
> > > > > workflow.[1] Then, I ran it in the testing environment. As the
> result
> > > > > shown here[2]:
> > > > > - For pulling image from DockerHub, it took 1 minute and 21 seconds
> > > > > - For building it locally, it took 2 minutes and 54 seconds.
> > > > >
> > > > > I prefer building it locally. Although it is slower, I think the
> time
> > > > > overhead, comparing to the cost of maintaining the image in
> DockerHub
> > > > > and the whole test process, is trivial for building or pulling the
> > > > > image.
> > > > >
> > > > > I look forward to hearing from you. ;)
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> https://github.com/KarmaGYZ/flink/commit/0406d942446a1b17f81d93235b21a829bf88ccf0
> > > > > [2]https://travis-ci.org/KarmaGYZ/flink/jobs/623207957
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > > > On Mon, Dec 9, 2019 at 2:39 PM Yang Wang 
> > > wrote:
> > > > > >
> > > > > > Thanks Yangze for s

Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration

2019-12-11 Thread Gary Yao
Thanks for driving this effort. Also +1 from my side. I have left a few
questions below.

> - Wordcount end-to-end test. For verifying the basic process of Mesos
> deployment.

Would this add additional test coverage compared to the
"multiple submissions" test case? I am asking because the E2E tests are
already
expensive to run, and adding new tests should be carefully considered.

> - State TTL RocksDb backend end-to-end test. For verifying memory
> configuration behaviors, since Mesos has it’s own config options and
> logics.

Can you elaborate more on this? Which config options are relevant here?

On Wed, Dec 11, 2019 at 9:58 AM Till Rohrmann  wrote:

> +1 for building the image locally. If need should arise, then we could
> change it always later.
>
> Cheers,
> Till
>
> On Wed, Dec 11, 2019 at 4:05 AM Xintong Song 
> wrote:
>
> > Thanks, Yangtze.
> >
> > +1 for building the image locally.
> > The time consumption for both building image locally and pulling it from
> > DockerHub sounds reasonable and affordable. Therefore, I'm also in favor
> of
> > avoiding the cost maintaining a custom image.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Wed, Dec 11, 2019 at 10:11 AM Yangze Guo  wrote:
> >
> > > Thanks for the feedback, Yang.
> > >
> > > Some updates I want to share in this thread.
> > > I have built a PoC version of Meos e2e test with WordCount
> > > workflow.[1] Then, I ran it in the testing environment. As the result
> > > shown here[2]:
> > > - For pulling image from DockerHub, it took 1 minute and 21 seconds
> > > - For building it locally, it took 2 minutes and 54 seconds.
> > >
> > > I prefer building it locally. Although it is slower, I think the time
> > > overhead, comparing to the cost of maintaining the image in DockerHub
> > > and the whole test process, is trivial for building or pulling the
> > > image.
> > >
> > > I look forward to hearing from you. ;)
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > [1]
> > >
> >
> https://github.com/KarmaGYZ/flink/commit/0406d942446a1b17f81d93235b21a829bf88ccf0
> > > [2]https://travis-ci.org/KarmaGYZ/flink/jobs/623207957
> > > Best,
> > > Yangze Guo
> > >
> > > On Mon, Dec 9, 2019 at 2:39 PM Yang Wang 
> wrote:
> > > >
> > > > Thanks Yangze for starting this discussion.
> > > >
> > > > Just share my thoughts.
> > > >
> > > > If the mesos official docker image could not meet our requirement, i
> > > suggest to build the image locally.
> > > > We have done the same things for yarn e2e tests. This way is more
> > > flexible and easy to maintain. However,
> > > > i have no idea how long building the mesos image locally will take.
> > > Based on previous experience of yarn, i
> > > > think it may not take too much time.
> > > >
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Yangze Guo  于2019年12月7日周六 下午4:25写道:
> > > >>
> > > >> Thanks for your feedback!
> > > >>
> > > >> @Till
> > > >> Regarding the time overhead, I think it mainly come from the network
> > > >> transmission. For building the image locally, it will totally
> download
> > > >> 260MB files including the base image and packages. For pulling from
> > > >> DockerHub, the compressed size of the image is 347MB. Thus, I agree
> > > >> that it is ok to build the image locally.
> > > >>
> > > >> @Piyush
> > > >> Thank you for offering the help and sharing your usage scenario. In
> > > >> current stage, I think it will be really helpful if you can compress
> > > >> the custom image[1] or reduce the time overhead to build it locally.
> > > >> Any ideas for improving test coverage will also be appreciated.
> > > >>
> > > >> [1]
> > >
> >
> https://hub.docker.com/layers/karmagyz/mesos-flink/latest/images/sha256-4e1caefea107818aa11374d6ac8a6e889922c81806f5cd791ead141f18ec7e64
> > > >>
> > > >> Best,
> > > >> Yangze Guo
> > > >>
> > > >> On Sat, Dec 7, 2019 at 3:17 AM Piyush Narang 
> > > wrote:
> > > >> >
> > > >> > +1 from our end as well. At Criteo, we are running some Flink jobs
> > on
> > > Mesos in production to compute short term features for machine
> learning.
> > > We’d love to help out and contribute on this initiative.
> > > >> >
> > > >> > Thanks,
> > > >> > -- Piyush
> > > >> >
> > > >> >
> > > >> > From: Till Rohrmann 
> > > >> > Date: Friday, December 6, 2019 at 8:10 AM
> > > >> > To: dev 
> > > >> > Cc: user 
> > > >> > Subject: Re: [DISCUSS] Adding e2e tests for Flink's Mesos
> > integration
> > > >> >
> > > >> > Big +1 for adding a fully working e2e test for Flink's Mesos
> > > integration. Ideally we would have it ready for the 1.10 release. The
> > lack
> > > of such a test has bitten us already multiple times.
> > > >> >
> > > >> > In general I would prefer to use the official image if possible
> > since
> > > it frees us from maintaining our own custom image. Since Java 9 is no
> > > longer officially supported as we opted for supporting Java 11 (LTS) it
> > > might not be feasible, though. How much longer would building the
> custom
> > > image vs. 

Re: [ANNOUNCE] Feature freeze for Apache Flink 1.10.0 release

2019-12-09 Thread Gary Yao
Hi all,

We have just created the release-1.10 branch. Please remember to merge bug
fixes to both release-1.10 and master branches from now on if you want the
fix
to be included in the Flink 1.10 release.

Best,
Yu & Gary

On Tue, Nov 19, 2019 at 4:44 PM Yu Li  wrote:

> Hi devs,
>
> Per the feature discussions and progress updates for 1.10.0 [1] [2] [3], we
> hereby
> announce the official feature freeze for Flink 1.10.0 to be on December 8.
> A release feature branch for 1.10 will be cut following that date.
>
> We’re roughly two and a half weeks away from this date, but please keep in
> mind
> that we still shouldn’t rush things. If you feel that there may be problems
> with
> this schedule for the things you are working on, please let us know here.
>
> Cheers,
> Gary and Yu
>
> [1]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Features-for-Apache-Flink-1-10-td32824.html
> [2]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Progress-of-Apache-Flink-1-10-1-td33570.html
> [3]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Progress-of-Apache-Flink-1-10-2-td34585.html
>


[jira] [Created] (FLINK-15163) japicmp should use 1.9 as the old version

2019-12-09 Thread Gary Yao (Jira)
Gary Yao created FLINK-15163:


 Summary: japicmp should use 1.9 as the old version
 Key: FLINK-15163
 URL: https://issues.apache.org/jira/browse/FLINK-15163
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


We should configure the japicmp-maven-plugin to use the latest Flink 1.9 
release as the reference version to compare against. Currently 1.8.0 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15082) Mesos App Master does not respect taskmanager.memory.total-process.size

2019-12-05 Thread Gary Yao (Jira)
Gary Yao created FLINK-15082:


 Summary: Mesos App Master does not respect 
taskmanager.memory.total-process.size
 Key: FLINK-15082
 URL: https://issues.apache.org/jira/browse/FLINK-15082
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


*Description*
 When the Mesos App Master is started with 
{{taskmanager.memory.total-process.size}}, [the value is not 
respected|https://github.com/apache/flink/blob/d08beaa3255b3df96afe35f17e257df31a0d71ed/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosTaskManagerParameters.java#L339].
 

One can reproduce this when starting the App Master with the command below:
{noformat}
/bin/mesos-appmaster.sh \ 
-Dtaskmanager.memory.total-process.size=2048m \
-Djobmanager.heap.size=2048m \
...
{noformat}
The ClusterEntryPoint will fail with an exception (see below). The reason is 
that the default value of {{mesos.resourcemanager.tasks.mem}} will be taken as 
the total process memory size (1024 mb).
{noformat}
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint MesosSessionClusterEntrypoint.
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:187)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:518)
at 
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:126)
Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
at 
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:261)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
... 2 more
Caused by: org.apache.flink.configuration.IllegalConfigurationException: Sum of 
configured Framework Heap Memory (134217728 bytes), Framework Off-Heap Memory 
(134217728 bytes), Task Off-Heap Memory (0 bytes), Managed Memory (719407031 
bytes) and Shuffle Memory (80530638 bytes) exceed configured Total Flink Memory 
(805306368 bytes).
at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveInternalMemoryFromTotalFlinkMemory(TaskExecutorResourceUtils.java:273)
at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveResourceSpecWithTotalProcessMemory(TaskExecutorResourceUtils.java:210)
at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:108)
at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:94)
at 
org.apache.flink.mesos.runtime.clusterframework.MesosTaskManagerParameters.create(MesosTaskManagerParameters.java:341)
at 
org.apache.flink.mesos.util.MesosUtils.createTmParameters(MesosUtils.java:109)
at 
org.apache.flink.mesos.runtime.clusterframework.MesosResourceManagerFactory.createActiveResourceManager(MesosResourceManagerFactory.java:80)
at 
org.apache.flink.runtime.resourcemanager.ActiveResourceManagerFactory.createResourceManager(ActiveResourceManagerFactory.java:58)
at 
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:170)
... 9 more
{noformat}
*Expected Behavior*
 * If taskmanager.memory.total-process.size and mesos.resourcemanager.tasks.mem 
are both set and differ in their values, an exception should be thrown
 * If only taskmanager.memory.total-process.size is set and 
mesos.resourcemanager.tasks.mem is not set, then the value configured by the 
former should be respected
 * If only mesos.resourcemanager.tasks.mem is set and 
taskmanager.memory.total-process.size is not set, then the value configured by 
the former should be respected



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15058) Log required config keys if TaskManager memory configuration is invalid

2019-12-04 Thread Gary Yao (Jira)
Gary Yao created FLINK-15058:


 Summary: Log required config keys if TaskManager memory 
configuration is invalid
 Key: FLINK-15058
 URL: https://issues.apache.org/jira/browse/FLINK-15058
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


Currently the error message is
{noformat}
Either Task Heap Memory size and Managed Memory size, or Total Flink Memory 
size, or Total Process Memory size need to be configured explicitly
{noformat}
However, it would be good to immediately see which config keys are expected to 
be configured.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15057) Set taskmanager.memory.total-process.size in jepsen tests

2019-12-04 Thread Gary Yao (Jira)
Gary Yao created FLINK-15057:


 Summary: Set taskmanager.memory.total-process.size in jepsen tests
 Key: FLINK-15057
 URL: https://issues.apache.org/jira/browse/FLINK-15057
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao


Set {{taskmanager.memory.total-process.size}} in {{flink-conf.yaml}} used by 
tests. Currently the taskmanager process fails due to
{noformat}
org.apache.flink.configuration.IllegalConfigurationException: Either Task Heap 
Memory size and Managed Memory size, or Total Flink Memory size, or Total 
Process Memory size need to be configured explicitly.
at 
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:110)
at 
org.apache.flink.runtime.taskexecutor.TaskManagerServicesConfiguration.fromConfiguration(TaskManagerServicesConfiguration.java:219)
at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:357)
at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.(TaskManagerRunner.java:153)
at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:327)
at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner$1.call(TaskManagerRunner.java:298)
at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner$1.call(TaskManagerRunner.java:295)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at 
org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:295)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15045) SchedulerBase should only log the RestartStrategy in legacy scheduling mode

2019-12-03 Thread Gary Yao (Jira)
Gary Yao created FLINK-15045:


 Summary: SchedulerBase should only log the RestartStrategy in 
legacy scheduling mode
 Key: FLINK-15045
 URL: https://issues.apache.org/jira/browse/FLINK-15045
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


In ng scheduling, we configure {{ThrowingRestartStrategy}} in the execution 
graph to assert that legacy code paths are not executed. Hence, the restart 
strategy should not be logged in ng scheduling mode. Currently, the following 
obsolete log message is logged:

{noformat}
2019-12-03 20:22:14,426 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
- Using restart strategy 
org.apache.flink.runtime.executiongraph.restart.ThrowingRestartStrategy@44003e98
 for General purpose test job (9af8d0845d449f3c3447f817b2150bc8).
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15008) Tests in flink-yarn-tests fail with ClassNotFoundException (JDK11)

2019-12-02 Thread Gary Yao (Jira)
Gary Yao created FLINK-15008:


 Summary: Tests in flink-yarn-tests fail with 
ClassNotFoundException (JDK11)
 Key: FLINK-15008
 URL: https://issues.apache.org/jira/browse/FLINK-15008
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


{noformat}
1) Error injecting constructor, java.lang.NoClassDefFoundError: 
javax/activation/DataSource
  at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver.(JAXBContextResolver.java:41)
  at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebApp.setup(RMWebApp.java:51)
  while locating 
org.apache.hadoop.yarn.server.resourcemanager.webapp.JAXBContextResolver

1 error
at com.google.inject.internal.InjectorImpl$4.get(InjectorImpl.java:987)
at 
com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1013)
at 
com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory$GuiceInstantiatedComponentProvider.getInstance(GuiceComponentProviderFactory.java:332)
at 
com.sun.jersey.core.spi.component.ioc.IoCProviderFactory$ManagedSingleton.(IoCProviderFactory.java:179)
at 
com.sun.jersey.core.spi.component.ioc.IoCProviderFactory.wrap(IoCProviderFactory.java:100)
at 
com.sun.jersey.core.spi.component.ioc.IoCProviderFactory._getComponentProvider(IoCProviderFactory.java:93)
at 
com.sun.jersey.core.spi.component.ProviderFactory.getComponentProvider(ProviderFactory.java:153)
at 
com.sun.jersey.core.spi.component.ProviderServices.getComponent(ProviderServices.java:251)
at 
com.sun.jersey.core.spi.component.ProviderServices.getProviders(ProviderServices.java:148)
at 
com.sun.jersey.core.spi.factory.ContextResolverFactory.init(ContextResolverFactory.java:83)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl._initiate(WebApplicationImpl.java:1271)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.access$700(WebApplicationImpl.java:169)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl$13.f(WebApplicationImpl.java:775)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl$13.f(WebApplicationImpl.java:771)
at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193)
at 
com.sun.jersey.server.impl.application.WebApplicationImpl.initiate(WebApplicationImpl.java:771)
at 
com.sun.jersey.guice.spi.container.servlet.GuiceContainer.initiate(GuiceContainer.java:121)
at 
com.sun.jersey.spi.container.servlet.ServletContainer$InternalWebComponent.initiate(ServletContainer.java:318)
at 
com.sun.jersey.spi.container.servlet.WebComponent.load(WebComponent.java:609)
at 
com.sun.jersey.spi.container.servlet.WebComponent.init(WebComponent.java:210)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:373)
at 
com.sun.jersey.spi.container.servlet.ServletContainer.init(ServletContainer.java:710)
at 
com.google.inject.servlet.FilterDefinition.init(FilterDefinition.java:114)
at 
com.google.inject.servlet.ManagedFilterPipeline.initPipeline(ManagedFilterPipeline.java:98)
at com.google.inject.servlet.GuiceFilter.init(GuiceFilter.java:172)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:936)
... 50 more
Caused by: java.lang.NoClassDefFoundError: javax/activation/DataSource
at 
com.sun.xml.bind.v2.model.impl.RuntimeBuiltinLeafInfoImpl.(RuntimeBuiltinLeafInfoImpl.java:457)
at 
com.sun.xml.bind.v2.model.impl.RuntimeTypeInfoSetImpl.(RuntimeTypeInfoSetImpl.java:65)
at 
com.sun.xml.bind.v2

[jira] [Created] (FLINK-14940) Travis build passes despite Test failures

2019-11-25 Thread Gary Yao (Jira)
Gary Yao created FLINK-14940:


 Summary: Travis build passes despite Test failures
 Key: FLINK-14940
 URL: https://issues.apache.org/jira/browse/FLINK-14940
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Affects Versions: 1.10.0
Reporter: Gary Yao


Build https://travis-ci.org/apache/flink/jobs/616462870 is green despite the 
presence of Test failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14939) StreamingKafkaITCase fails due to distDir property not being set

2019-11-25 Thread Gary Yao (Jira)
Gary Yao created FLINK-14939:


 Summary: StreamingKafkaITCase fails due to distDir property not 
being set
 Key: FLINK-14939
 URL: https://issues.apache.org/jira/browse/FLINK-14939
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


https://api.travis-ci.org/v3/job/616462870/log.txt

{noformat}
08:12:34.965 [INFO] ---
08:12:34.965 [INFO]  T E S T S
08:12:34.965 [INFO] ---
08:12:35.868 [INFO] Running 
org.apache.flink.tests.util.kafka.StreamingKafkaITCase
08:12:35.893 [ERROR] Tests run: 3, Failures: 3, Errors: 0, Skipped: 0, Time 
elapsed: 0.02 s <<< FAILURE! - in 
org.apache.flink.tests.util.kafka.StreamingKafkaITCase
08:12:35.893 [ERROR] testKafka[0: 
kafka-version:0.10.2.0](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) 
 Time elapsed: 0.009 s  <<< FAILURE!
java.lang.AssertionError: The distDir property was not set. You can set it when 
running maven via -DdistDir= .
at 
org.apache.flink.tests.util.kafka.StreamingKafkaITCase.(StreamingKafkaITCase.java:71)

08:12:35.893 [ERROR] testKafka[1: 
kafka-version:0.11.0.2](org.apache.flink.tests.util.kafka.StreamingKafkaITCase) 
 Time elapsed: 0.001 s  <<< FAILURE!
java.lang.AssertionError: The distDir property was not set. You can set it when 
running maven via -DdistDir= .
at 
org.apache.flink.tests.util.kafka.StreamingKafkaITCase.(StreamingKafkaITCase.java:71)

08:12:35.893 [ERROR] testKafka[2: 
kafka-version:2.2.0](org.apache.flink.tests.util.kafka.StreamingKafkaITCase)  
Time elapsed: 0.001 s  <<< FAILURE!
java.lang.AssertionError: The distDir property was not set. You can set it when 
running maven via -DdistDir= .
at 
org.apache.flink.tests.util.kafka.StreamingKafkaITCase.(StreamingKafkaITCase.java:71)

08:12:36.233 [INFO] 
08:12:36.233 [INFO] Results:
08:12:36.233 [INFO] 
08:12:36.233 [ERROR] Failures: 
08:12:36.233 [ERROR]   StreamingKafkaITCase.:71 The distDir property was 
not set. You can set it when running maven via -DdistDir= .
08:12:36.233 [ERROR]   StreamingKafkaITCase.:71 The distDir property was 
not set. You can set it when running maven via -DdistDir= .
08:12:36.233 [ERROR]   StreamingKafkaITCase.:71 The distDir property was 
not set. You can set it when running maven via -DdistDir= .
08:12:36.233 [INFO] 
08:12:36.233 [ERROR] Tests run: 3, Failures: 3, Errors: 0, Skipped: 0
08:12:36.233 [INFO] 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14929) ContinuousFileProcessingCheckpointITCase sporadically fails due to FileNotFoundException

2019-11-22 Thread Gary Yao (Jira)
Gary Yao created FLINK-14929:


 Summary: ContinuousFileProcessingCheckpointITCase sporadically 
fails due to FileNotFoundException
 Key: FLINK-14929
 URL: https://issues.apache.org/jira/browse/FLINK-14929
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Common, Tests
Affects Versions: 1.10.0
 Environment: rev: c7ae2b8559c0fe4cf17613249db0c2857bb91d94
Reporter: Gary Yao


*Description*
Test fails locally approximately 1 out of 200 times.

*Stacktrace*
{noformat}
org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 
73b2ee613731d76fb9ab20142c3ba52f)
 at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:144)
 at 
org.apache.flink.test.checkpointing.StreamFaultToleranceTestBase.runCheckpointedProgram(StreamFaultToleranceTestBase.java:132)
 at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
 at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
 at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
 at org.junit.runners.Suite.runChild(Suite.java:128)
 at org.junit.runners.Suite.runChild(Suite.java:27)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
 at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
 at 
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
 at 
com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:54)
 at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
 at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
 at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
 at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:142)
 ... 34 more
Caused by: java.lang.IllegalStateException: Cannot process mail Report 
throwable java.io.FileNotFoundException: File 
file:/home/gary/code/flink/flink-tests/target/localfs/fs_tests/..file3.crc does 
not exist or the user running Flink ('gary') has insufficient permissions to 
access it.
 at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:68)
 at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:213)
 at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:154)
 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:445)
 at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:702)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:527)
 at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.WrappingRuntimeException: 
java.io.FileNotFoundException: File 
file:/home/gary/code/flink/flink-tests/target/localfs/fs_tests/..file3.crc does 
not exist or the user running Flink ('gary') has insufficient permissions to 
access it.
 at 
org.apache.flink.util.WrappingRuntimeException.wrapIfNecessary(WrappingRuntimeException.java:65)
 at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.lambda$reportThrowable$0(MailboxProcessor.java:166

Re: [VOTE] FLIP-83: Flink End-to-end Performance Testing Framework

2019-11-21 Thread Gary Yao
+1 (binding)

Best,
Gary

On Thu, Nov 21, 2019 at 7:55 AM Zhu Zhu  wrote:

> Thanks Yu for this proposal.
> +1(non-binding) for the overall design.
>
> I have 2 questions about the result check though. Posted in the discussion
> ML thread.
>
> Thanks,
> Zhu Zhu
>
> Becket Qin  于2019年11月21日周四 上午11:57写道:
>
> > Hi Yu,
> >
> > Thanks for updating the FLIP wiki. +1 from me. I do not have further
> > questions.
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Wed, Nov 20, 2019 at 8:58 PM Yu Li  wrote:
> >
> > > Thanks all for the vote and further discussions.
> > >
> > > I have updated the FLIP document according to aihua's reply (with some
> > > minor refinement/supplements). For your convenience, below are the
> > contents
> > > I added:
> > > **
> > >
> > > *There're also other dimensions other than Flink characteristics,
> > > including:*
> > >
> > >- *Record size: to check both the processing (records/s) and data
> > >(bytes/s) throughput, we will test the 10B, 100B and 1KB record size
> > for
> > >each test job.*
> > >- *Resource for each task: we will use the Flink default settings to
> > >cover the most used cases.*
> > >- *Job Parallelism: we will increase the parallelism to saturate the
> > >system until back-pressure.*
> > >- *Source and Sink: to focus on Flink performance, we generate the
> > >source data randomly and use a blackhole consumer as the sink.*
> > >
> > > **
> > >
> > > @becket.qin  please let us know if the updates
> > look
> > > good to you to turn your voting to a complete binding +1, or if any
> more
> > > concerns/comments. Thanks!
> > >
> > > @voters: please also take a look at the updates and let us know if any
> > new
> > > comments. Thanks.
> > >
> > > Best Regards,
> > > Yu
> > >
> > >
> > > On Wed, 20 Nov 2019 at 11:10, aihua li  wrote:
> > >
> > >> hi,Becket,
> > >>
> > >> Thanks for the comments!
> > >>
> > >> > 1. How do the testing records look like? The size and key
> > distributions.
> > >>
> > >> The records looks like a long string,which default size is 1k. The key
> > is
> > >> randomly generated according to the specified range plus a fixed
> string
> > to
> > >> assure that the data is evenly distributed on each task.
> > >>
> > >> > 2. The resources for each task.
> > >>
> > >> The resources for each task used the default value.
> > >>
> > >> > 3. The intended configuration for the jobs.
> > >>
> > >> The parallelism of test Job will be adjusted according to resource
> > >> conditions to fill the cluster as much as possible. Other
> configurations
> > >> are not supported at this time.
> > >>
> > >> > 4. What exact source and sink it would use.
> > >>
> > >> In order to reduce the dependence on the external, the source data is
> > >> generated randomly, the sink only supports hdfs or no sinks.
> > >>
> > >> we will add this details to the flip laterly.
> > >>
> > >>
> > >> > 在 2019年11月18日,下午7:59,Becket Qin  写道:
> > >> >
> > >> > +1 (binding) on having the test suite.
> > >> >
> > >> > BTW, it would be good to have a few more details about the
> performance
> > >> > tests. For example:
> > >> > 1. How do the testing records look like? The size and key
> > distributions.
> > >> > 2. The resources for each task.
> > >> > 3. The intended configuration for the jobs.
> > >> > 4. What exact source and sink it would use.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jiangjie (Becket) Qin
> > >> >
> > >> > On Mon, Nov 18, 2019 at 7:25 PM Zhijiang <
> wangzhijiang...@aliyun.com
> > >> .invalid>
> > >> > wrote:
> > >> >
> > >> >> +1 (binding)!
> > >> >>
> > >> >> It is a good thing to enhance our testing work.
> > >> >>
> > >> >> Best,
> > >> >> Zhijiang
> > >> >>
> > >> >>
> > >> >> --
> > >> >> From:Hequn Cheng 
> > >> >> Send Time:2019 Nov. 18 (Mon.) 18:22
> > >> >> To:dev 
> > >> >> Subject:Re: [VOTE] FLIP-83: Flink End-to-end Performance Testing
> > >> Framework
> > >> >>
> > >> >> +1 (binding)!
> > >> >> I think this would be very helpful to detect regression problems.
> > >> >>
> > >> >> Best, Hequn
> > >> >>
> > >> >> On Mon, Nov 18, 2019 at 4:28 PM vino yang 
> > >> wrote:
> > >> >>
> > >> >>> +1 (non-binding)
> > >> >>>
> > >> >>> Best,
> > >> >>> Vino
> > >> >>>
> > >> >>> jincheng sun  于2019年11月18日周一 下午2:31写道:
> > >> >>>
> > >>  +1  (binding)
> > >> 
> > >>  OpenInx  于2019年11月18日周一 下午12:09写道:
> > >> 
> > >> > +1  (non-binding)
> > >> >
> > >> > On Mon, Nov 18, 2019 at 11:54 AM aihua li <
> liaihua1...@gmail.com>
> > >> >>> wrote:
> > >> >
> > >> >> +1  (non-binding)
> > >> >>
> > >> >> Thanks Yu Li for driving on this.
> > >> >>
> > >> >>> 在 2019年11月15日,下午8:10,Yu Li  写道:
> > >> >>>
> > >> >>> Hi All,
> > >> >>>
> > >> >>> I would like to start the vote for FLIP-83 [1] which is
> > discussed
> > >> >>> and
> > >> >>> reached 

[jira] [Created] (FLINK-14894) HybridOffHeapUnsafeMemorySegmentTest#testByteBufferWrap failed on Travis

2019-11-21 Thread Gary Yao (Jira)
Gary Yao created FLINK-14894:


 Summary: HybridOffHeapUnsafeMemorySegmentTest#testByteBufferWrap 
failed on Travis
 Key: FLINK-14894
 URL: https://issues.apache.org/jira/browse/FLINK-14894
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


{noformat}
HybridOffHeapUnsafeMemorySegmentTest>MemorySegmentTestBase.testByteBufferWrapping:2465
 expected:<992288337> but was:<196608>
{noformat}

https://api.travis-ci.com/v3/job/258950527/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14859) Avoid leaking unassigned Slot in DefaultScheduler when Deployment is outdated

2019-11-19 Thread Gary Yao (Jira)
Gary Yao created FLINK-14859:


 Summary: Avoid leaking unassigned Slot in DefaultScheduler when 
Deployment is outdated
 Key: FLINK-14859
 URL: https://issues.apache.org/jira/browse/FLINK-14859
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


In {{DefaultScheduler#assignResourceOrHandleError()}}, if the deployment is 
outdated, we should release the possibly acquired {{LogicalSlot}} so that we do 
not leak resources.

Below is an example to illustrate how slot leak is currently possible:
# Vertices A1, A2, A3 are scheduled in a batch.
# A2 acquires a slot. A1, A3 do not.
# A1 fails due to slot allocation timeout and triggers failover 
({{DefaultScheduler#cancelTasksAsync}})
# A2 is canceled first and its returned slot is assigned to A3, which triggers 
{{DefaultScheduler#assignResourceOrHandleError}} of A3.
  However, A3 is not canceled yet but it is outdated because 
{{executionVertexVersioner#recordVertexModifications}} was already invoked



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14843) Streaming bucketing end-to-end test can fail with Output hash mismatch

2019-11-18 Thread Gary Yao (Jira)
Gary Yao created FLINK-14843:


 Summary: Streaming bucketing end-to-end test can fail with Output 
hash mismatch
 Key: FLINK-14843
 URL: https://issues.apache.org/jira/browse/FLINK-14843
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem, Tests
Affects Versions: 1.10.0
 Environment: rev: dcc1330375826b779e4902176bb2473704dabb11
Reporter: Gary Yao


*Description*
Streaming bucketing end-to-end test ({{test_streaming_bucketing.sh}}) can fail 
with Output hash mismatch.

{noformat}
Number of running task managers has reached 4.
Job (67212178694f8b2a9bc9d9572567a53f) is running.
Waiting until all values have been produced
Truncating buckets
Number of produced values 26325/6
Truncating buckets
Number of produced values 31315/6
Truncating buckets
Number of produced values 36735/6
Truncating buckets
Number of produced values 40705/6
Truncating buckets
Number of produced values 46125/6
Truncating buckets
Number of produced values 51135/6
Truncating buckets
Number of produced values 56555/6
Truncating buckets
Number of produced values 61935/6
Cancelling job 67212178694f8b2a9bc9d9572567a53f.
Cancelled job 67212178694f8b2a9bc9d9572567a53f.
Waiting for job (67212178694f8b2a9bc9d9572567a53f) to reach terminal state 
CANCELED ...
Job (67212178694f8b2a9bc9d9572567a53f) reached terminal state CANCELED
Job 67212178694f8b2a9bc9d9572567a53f was cancelled, time to verify
FAIL Bucketing Sink: Output hash mismatch.  Got 
4e2d1859e41184a38e5bc95090fe9941, expected 01aba5ff77a0ef5e5cf6a727c248bdc3.
head hexdump of actual:
000   (   2   ,   1   0   ,   0   ,   S   o   m   e   p   a   y
010   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   1
020   ,   S   o   m   e   p   a   y   l   o   a   d   .   .   .
030   )  \n   (   2   ,   1   0   ,   2   ,   S   o   m   e   p
040   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0
050   ,   3   ,   S   o   m   e   p   a   y   l   o   a   d   .
060   .   .   )  \n   (   2   ,   1   0   ,   4   ,   S   o   m   e
070   p   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,
080   1   0   ,   5   ,   S   o   m   e   p   a   y   l   o   a
090   d   .   .   .   )  \n   (   2   ,   1   0   ,   6   ,   S   o
0a0   m   e   p   a   y   l   o   a   d   .   .   .   )  \n   (
0b0   2   ,   1   0   ,   7   ,   S   o   m   e   p   a   y   l
0c0   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   8   ,
0d0   S   o   m   e   p   a   y   l   o   a   d   .   .   .   )
0e0  \n   (   2   ,   1   0   ,   9   ,   S   o   m   e   p   a
0f0   y   l   o   a   d   .   .   .   )  \n
0fa
Stopping taskexecutor daemon (pid: 654547) on host gyao-desktop.
Stopping standalonesession daemon (pid: 650368) on host gyao-desktop.
Stopping taskexecutor daemon (pid: 650812) on host gyao-desktop.
Skipping taskexecutor daemon (pid: 651347), because it is not running anymore 
on gyao-desktop.
Skipping taskexecutor daemon (pid: 651795), because it is not running anymore 
on gyao-desktop.
Skipping taskexecutor daemon (pid: 652249), because it is not running anymore 
on gyao-desktop.
Stopping taskexecutor daemon (pid: 653481) on host gyao-desktop.
Stopping taskexecutor daemon (pid: 654099) on host gyao-desktop.
[FAIL] Test script contains errors.
Checking of logs skipped.

[FAIL] 'flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh' failed 
after 2 minutes and 3 seconds! Test exited with exit code 1
{noformat}


*How to reproduce*
Comment out the delay of 10s after the 1st TM is restarted to provoke the issue:

{code:bash}
echo "Restarting 1 TM"
$FLINK_DIR/bin/taskmanager.sh start
wait_for_number_of_running_tms 4

#sleep 10

echo "Killing 2 TMs"
kill_random_taskmanager
kill_random_taskmanager
wait_for_number_of_running_tms 2
{code}

Command to run the test:
{noformat}
FLINK_DIR=build-target/ flink-end-to-end-tests/run-single-test.sh skip 
flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh
{noformat}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14834) Running Kerberized YARN on Docker test (custom fs plugin) fails on Travis

2019-11-17 Thread Gary Yao (Jira)
Gary Yao created FLINK-14834:


 Summary: Running Kerberized YARN on Docker test (custom fs plugin) 
fails on Travis
 Key: FLINK-14834
 URL: https://issues.apache.org/jira/browse/FLINK-14834
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


https://api.travis-ci.org/v3/job/612782888/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14823) Enable 'Streaming bucketing end-to-end test' to pass with new DefaultScheduler

2019-11-15 Thread Gary Yao (Jira)
Gary Yao created FLINK-14823:


 Summary: Enable 'Streaming bucketing end-to-end test' to pass with 
new DefaultScheduler
 Key: FLINK-14823
 URL: https://issues.apache.org/jira/browse/FLINK-14823
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14822) Enable 'Streaming File Sink end-to-end test' to pass with new DefaultScheduler

2019-11-15 Thread Gary Yao (Jira)
Gary Yao created FLINK-14822:


 Summary: Enable 'Streaming File Sink end-to-end test' to pass with 
new DefaultScheduler
 Key: FLINK-14822
 URL: https://issues.apache.org/jira/browse/FLINK-14822
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


The tests fails because we exhaust the number of restarts (3). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14821) Enable 'Queryable state (rocksdb) with TM restart' E2E test to pass with new DefaultScheduler

2019-11-15 Thread Gary Yao (Jira)
Gary Yao created FLINK-14821:


 Summary: Enable 'Queryable state (rocksdb) with TM restart' E2E 
test to pass with new DefaultScheduler
 Key: FLINK-14821
 URL: https://issues.apache.org/jira/browse/FLINK-14821
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


The test expects that the job transitions from {{RESTARTING}} to {{CREATED}} . 
However, the {{DefaultScheduler}} will not transition the job to {{CREATED}}. 
Moreover, waiting for the transition is not needed because the loss of the TM 
is already guaranteed by
{code}
wait_for_number_of_running_tms 0
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14809) DataStreamAllroundTestProgram does not run because return types cannot be determined

2019-11-15 Thread Gary Yao (Jira)
Gary Yao created FLINK-14809:


 Summary: DataStreamAllroundTestProgram does not run because return 
types cannot be determined
 Key: FLINK-14809
 URL: https://issues.apache.org/jira/browse/FLINK-14809
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


{noformat}
2019-11-14 19:34:55,185 ERROR org.apache.flink.client.cli.CliFrontend   
- Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: The return type of function 
'main(DataStreamAllroundTestProgram.java:182)' could not be determined 
automatically, due to type erasure. You can give type information hints by 
using the returns(...) method on the result of the transformation call, or by 
letting your function implement the 'ResultTypeQueryable' interface.
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:336)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:206)
at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:173)
at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:747)
at 
org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:282)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:219)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1011)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1084)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1084)
Caused by: org.apache.flink.api.common.functions.InvalidTypesException: The 
return type of function 'main(DataStreamAllroundTestProgram.java:182)' could 
not be determined automatically, due to type erasure. You can give type 
information hints by using the returns(...) method on the result of the 
transformation call, or by letting your function implement the 
'ResultTypeQueryable' interface.
at 
org.apache.flink.api.dag.Transformation.getOutputType(Transformation.java:412)
at 
org.apache.flink.streaming.api.datastream.DataStream.addSink(DataStream.java:1296)
at 
org.apache.flink.streaming.tests.DataStreamAllroundTestProgram.main(DataStreamAllroundTestProgram.java:185)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:322)
... 12 more
Caused by: org.apache.flink.api.common.functions.InvalidTypesException: Input 
mismatch: Generic type 'org.apache.flink.streaming.tests.Event' or a subclass 
of it expected but was 'org.apache.flink.streaming.tests.Event'.
at 
org.apache.flink.api.java.typeutils.TypeExtractor.validateInputType(TypeExtractor.java:1298)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.getUnaryOperatorReturnType(TypeExtractor.java:585)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.getFlatMapReturnTypes(TypeExtractor.java:196)
at 
org.apache.flink.streaming.api.datastream.DataStream.flatMap(DataStream.java:634)
at 
org.apache.flink.streaming.tests.DataStreamAllroundTestProgram.main(DataStreamAllroundTestProgram.java:182)
... 17 more
Caused by: org.apache.flink.api.common.functions.InvalidTypesException: Generic 
type 'org.apache.flink.streaming.tests.Event' or a subclass of it expected but 
was 'org.apache.flink.streaming.tests.Event'.
at 
org.apache.flink.api.java.typeutils.TypeExtractor.validateInfo(TypeExtractor.java:1481)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.validateInfo(TypeExtractor.java:1491)
at 
org.apache.flink.api.java.typeutils.TypeExtractor.validateInputType(TypeExtractor.java:1295)
... 21 more
{noformat}

This happens in multiple nightlies and jepsen runs. Example
https://api.travis-ci.org/v3/job/611848582/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14780) Avoid leaking instance of DefaultScheduler before object is constructed

2019-11-14 Thread Gary Yao (Jira)
Gary Yao created FLINK-14780:


 Summary: Avoid leaking instance of DefaultScheduler before object 
is constructed
 Key: FLINK-14780
 URL: https://issues.apache.org/jira/browse/FLINK-14780
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


An instance of {{DefaultScheduler}} may leak to a metric reporter thread before 
the instance of the object is fully constructed. This can lead to an NPE (see 
below.

https://api.travis-ci.org/v3/job/611597698/log.txt

{noformat}
java.lang.NullPointerException
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.getNumberOfRestarts(DefaultScheduler.java:156)
at 
org.apache.flink.metrics.slf4j.Slf4jReporter.tryReport(Slf4jReporter.java:114)
at 
org.apache.flink.metrics.slf4j.Slf4jReporter.report(Slf4jReporter.java:80)
at 
org.apache.flink.runtime.metrics.MetricRegistryImpl$ReporterTask.run(MetricRegistryImpl.java:436)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14682) Enable AbstractTaskManagerProcessFailureRecoveryTest to pass with new DefaultScheduler

2019-11-08 Thread Gary Yao (Jira)
Gary Yao created FLINK-14682:


 Summary: Enable AbstractTaskManagerProcessFailureRecoveryTest to 
pass with new DefaultScheduler
 Key: FLINK-14682
 URL: https://issues.apache.org/jira/browse/FLINK-14682
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


Investigate the reason {{testTaskManagerProcessFailure()}} fails, and fix 
issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14681) Enable YARNHighAvailabilityITCase to pass with new DefaultScheduler

2019-11-08 Thread Gary Yao (Jira)
Gary Yao created FLINK-14681:


 Summary: Enable YARNHighAvailabilityITCase to pass with new 
DefaultScheduler
 Key: FLINK-14681
 URL: https://issues.apache.org/jira/browse/FLINK-14681
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


Investigate why {{testKillYarnSessionClusterEntrypoint}} and 
{{testJobRecoversAfterKillingTaskManager}} fail and fix issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14680) Enable KafkaConsumerTestBase#runFailOnNoBrokerTest to pass with new DefaultScheduler

2019-11-08 Thread Gary Yao (Jira)
Gary Yao created FLINK-14680:


 Summary: Enable KafkaConsumerTestBase#runFailOnNoBrokerTest to 
pass with new DefaultScheduler
 Key: FLINK-14680
 URL: https://issues.apache.org/jira/browse/FLINK-14680
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


{{KafkaConsumerTestBase#runFailOnNoBrokerTest}} has assumptions on the causal 
chain of the {{JobExecutionException}}. In particular, it assumes that the 
exception caused by user code is the direct cause of {{JobExecutionException}}. 
However, this is no longer true when using the {{DefaultScheduler}}, which 
wraps the exception in an {{JobException}}, which additionally specifies the 
reason of the job recovery suppression.

The  code in question is listed below:
{code:java}
} catch (JobExecutionException jee) {
if (kafkaServer.getVersion().equals("0.9") ||
kafkaServer.getVersion().equals("0.10") ||
kafkaServer.getVersion().equals("0.11") ||
kafkaServer.getVersion().equals("2.0")) {
assertTrue(jee.getCause() instanceof 
TimeoutException);

TimeoutException te = (TimeoutException) 
jee.getCause();

assertEquals("Timeout expired while fetching 
topic metadata", te.getMessage());
} else {
assertTrue(jee.getCause() instanceof 
RuntimeException);

RuntimeException re = (RuntimeException) 
jee.getCause();

assertTrue(re.getMessage().contains("Unable to 
retrieve any partitions"));
}
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14651) Set default value of config option jobmanager.scheduler to "ng"

2019-11-07 Thread Gary Yao (Jira)
Gary Yao created FLINK-14651:


 Summary: Set default value of config option jobmanager.scheduler 
to "ng"
 Key: FLINK-14651
 URL: https://issues.apache.org/jira/browse/FLINK-14651
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14636) Handle schedule mode LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST correctly in DefaultScheduler

2019-11-06 Thread Gary Yao (Jira)
Gary Yao created FLINK-14636:


 Summary: Handle schedule mode 
LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST correctly in DefaultScheduler
 Key: FLINK-14636
 URL: https://issues.apache.org/jira/browse/FLINK-14636
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.10.0
Reporter: Gary Yao
Assignee: Gary Yao
 Fix For: 1.10.0


It should be possible to schedule a job with 
{{ScheduleMode.LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14632) ExecutionGraphSchedulingTest.testSlotReleasingFailsSchedulingOperation deadlocks

2019-11-06 Thread Gary Yao (Jira)
Gary Yao created FLINK-14632:


 Summary: 
ExecutionGraphSchedulingTest.testSlotReleasingFailsSchedulingOperation deadlocks
 Key: FLINK-14632
 URL: https://issues.apache.org/jira/browse/FLINK-14632
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


https://api.travis-ci.com/v3/job/253364947/log.txt

{noformat}
"main" #1 prio=5 os_prio=0 tid=0x7fac3c00b800 nid=0x956 waiting on 
condition [0x7fac449f1000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x8eb74600> (a 
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at 
org.apache.flink.runtime.executiongraph.ExecutionGraphSchedulingTest.testSlotReleasingFailsSchedulingOperation(ExecutionGraphSchedulingTest.java:501)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14628) Wordcount on Docker test (custom fs plugin) fails on Travis

2019-11-06 Thread Gary Yao (Jira)
Gary Yao created FLINK-14628:


 Summary: Wordcount on Docker test (custom fs plugin) fails on 
Travis
 Key: FLINK-14628
 URL: https://issues.apache.org/jira/browse/FLINK-14628
 Project: Flink
  Issue Type: Bug
  Components: FileSystems, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


https://api.travis-ci.org/v3/job/607616429/log.txt

{noformat}
Successfully tagged test_docker_embedded_job:latest
~/build/apache/flink
sort: cannot read: 
'/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53405131685/out/docker_wc_out*':
 No such file or directory
FAIL WordCount: Output hash mismatch.  Got d41d8cd98f00b204e9800998ecf8427e, 
expected 72a690412be8928ba239c2da967328a5.
head hexdump of actual:
head: cannot open 
'/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-53405131685/out/docker_wc_out*'
 for reading: No such file or directory
[FAIL] Test script contains errors.
Checking for errors...
No errors in log files.
Checking for exceptions...
No exceptions in log files.
Checking for non-empty .out files...
grep: 
/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/log/*.out:
 No such file or directory
No non-empty .out files.

{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[ANNOUNCE] Progress of Apache Flink 1.10 #2

2019-11-01 Thread Gary Yao
Hi community,

Because we have approximately one month of development time left until the
targeted Flink 1.10 feature freeze, we thought now would be a good time to
give another progress update. Below we have included a list of the ongoing
efforts that have made progress since our last release progress update [1].
As
always, if you are working on something that is not included here, feel free
to use this thread to share your progress.

- Support Java 11 [2]
- Implementation is in progress (18/21 subtasks resolved)

- Table API improvements
- Full Data Type Support in Planner [3]
- Implementing (1/8 subtasks resolved)
- FLIP-66 Support Time Attribute in SQL DDL [4]
- Implementation is in progress (1/7 subtasks resolved).
- FLIP-70 Support Computed Column [5]
- FLIP voting [6]
- FLIP-63 Rework Table Partition Support [7]
- Implementation is in progress (3/15 subtasks resolved).
- FLIP-51 Rework of Expression Design [8]
- Implementation is in progress (2/12 subtasks resolved).
- FLIP-64 Support for Temporary Objects in Table Module [9]
- Implementation is in progress

- Hive compatibility completion (DDL/UDF) to support full Hive integration
- FLIP-57 Rework FunctionCatalog [10]
- Implementation is in progress (6/9 subtasks resolved)
- FLIP-68 Extend Core Table System with Modular Plugins [11]
- Implementation is in progress (2/8 subtasks resolved)

- Finer grained resource management
- FLIP-49: Unified Memory Configuration for TaskExecutors [12]
- Implementation is in progress (6/10 subtasks resolved)
- FLIP-53: Fine Grained Operator Resource Management [13]
- Implementation is in progress (1/9 subtasks resolved)

- Finish scheduler re-architecture [14]
- Integration tests are being enabled for new scheduler

- Executor/Client refactoring [15]
- FLIP-81: Executor-related new ConfigOptions [16]
- done
- FLIP-73: Introducing Executors for job submission [17]
- Implementation is in progress

- FLIP-36 Support Interactive Programming [18]
- Is built on top of FLIP-67 [19], which has been accepted
- Implementation in progress

- FLIP-58: Flink Python User-Defined Stateless Function for Table [20]
- Implementation is in progress (12/22 subtask resolved)
- FLIP-50: Spill-able Heap Keyed State Backend [21]
- Implementation is in progress (2/11 subtasks resolved)

- RocksDB Backend Memory Control [22]
- FLIP for resource management on state backend will be opened soon
- Write Buffer Manager will be backported to FRocksDB due to
performance regression [23] in new RocksDB versions

- Unaligned Checkpoints
- FLIP-76 [24] was published and received positive feedback
- Implementation is in progress

- Separate framework and user class loader in per-job mode [25]
- First PR is almost done. Remaining PRs will be ready next week

- Active Kubernetes Integration [26]
- Implementation is in progress (6/11 in review, 3/11 in progress, 2/11
todo)

- FLIP-39 Flink ML pipeline and ML libs [27]
- A few abstract ML classes have been merged (FLINK-13339, FLINK-13513)
- Starting review of algorithms

Again, the feature freeze is targeted to be at the end of November. Please
make sure that all important work threads can be completed until that date.
Feel free to use this thread to communicate any concerns about features that
might not be finished until then. We will send another announcement later in
the release cycle to make the date of the feature freeze official.

Best,
Yu & Gary

[1] https://s.apache.org/wc0dc
[2] https://issues.apache.org/jira/browse/FLINK-10725
[3] https://issues.apache.org/jira/browse/FLINK-14079
[4]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+time+attribute+in+SQL+DDL
[5]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-70%3A+Flink+SQL+Computed+Column+Design
[6]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-FLIP-70-Flink-SQL-Computed-Column-Design-td34385.html
[7]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-63%3A+Rework+table+partition+support
[8]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
[9]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-64%3A+Support+for+Temporary+Objects+in+Table+module
[10]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog
[11]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Pluggable+Modules
[12]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
[13]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-53%3A+Fine+Grained+Operator+Resource+Management
[14] https://issues.apache.org/jira/browse/FLINK-10429
[15]
https://lists.apache.org/thread.html/ce99cba4a10b9dc40eb729d39910f315ae41d80ec74f09a356c73938@%3Cdev.flink.apache.org%3E
[16]

[jira] [Created] (FLINK-14592) FlinkKafkaInternalProducerITCase fails with BindException

2019-11-01 Thread Gary Yao (Jira)
Gary Yao created FLINK-14592:


 Summary: FlinkKafkaInternalProducerITCase  fails with BindException
 Key: FLINK-14592
 URL: https://issues.apache.org/jira/browse/FLINK-14592
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


FlinkKafkaInternalProducerITCase fails with java.net.BindException: Address 
already in use.

Logs: https://api.travis-ci.org/v3/job/605478801/log.txt

{noformat}
02:04:04.878 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 8.822 s <<< FAILURE! - in 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase
02:04:04.882 [ERROR] 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase  
Time elapsed: 8.822 s  <<< ERROR!
org.apache.kafka.common.KafkaException: Socket server failed to bind to 
0.0.0.0:38437: Address already in use.
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase.prepare(FlinkKafkaInternalProducerITCase.java:59)
Caused by: java.net.BindException: Address already in use
at 
org.apache.flink.streaming.connectors.kafka.FlinkKafkaInternalProducerITCase.prepare(FlinkKafkaInternalProducerITCase.java:59)

{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14587) HiveTableSourceTest timeouts on Travis

2019-10-31 Thread Gary Yao (Jira)
Gary Yao created FLINK-14587:


 Summary: HiveTableSourceTest timeouts on Travis
 Key: FLINK-14587
 URL: https://issues.apache.org/jira/browse/FLINK-14587
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


{{HiveTableSourceTest}} failed in nightly tests running on Travis (stage: misc 
- scala 2.12)

https://travis-ci.org/apache/flink/builds/604945292?utm_source=slack_medium=notification

https://api.travis-ci.org/v3/job/604945309/log.txt

{noformat}
15:54:17.188 [INFO] Results:
15:54:17.188 [INFO] 
15:54:17.188 [ERROR] Errors: 
15:54:17.188 [ERROR]   
HiveTableSourceTest.org.apache.flink.connectors.hive.HiveTableSourceTest » 
Timeout
15:54:17.188 [INFO] 
15:54:17.188 [ERROR] Tests run: 232, Failures: 0, Errors: 1, Skipped: 1
15:54:17.188 [INFO] 
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14576) Elasticsearch (v6.3.1) sink end-to-end test instable

2019-10-30 Thread Gary Yao (Jira)
Gary Yao created FLINK-14576:


 Summary: Elasticsearch (v6.3.1) sink end-to-end test instable
 Key: FLINK-14576
 URL: https://issues.apache.org/jira/browse/FLINK-14576
 Project: Flink
  Issue Type: Bug
  Components: Connectors / ElasticSearch, Tests
Reporter: Gary Yao


https://api.travis-ci.org/v3/job/604417300/log.txt

{noformat}
2019-10-29 17:12:13,668 ERROR 
org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase  - 
Failed Elasticsearch item request: [:intentional invalid index:] 
ElasticsearchException[Elasticsearch exception 
[type=invalid_index_name_exception, reason=Invalid index name [:intentional 
invalid index:], must not contain the following characters [ , ", *, \, <, |, 
,, >, /, ?]]]
[:intentional invalid index:] ElasticsearchException[Elasticsearch exception 
[type=invalid_index_name_exception, reason=Invalid index name [:intentional 
invalid index:], must not contain the following characters [ , ", *, \, <, |, 
,, >, /, ?]]]
at 
org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:510)
at 
org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:421)
at 
org.elasticsearch.action.bulk.BulkItemResponse.fromXContent(BulkItemResponse.java:135)
at 
org.elasticsearch.action.bulk.BulkResponse.fromXContent(BulkResponse.java:198)
at 
org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:653)
at 
org.elasticsearch.client.RestHighLevelClient.lambda$performRequestAsyncAndParseEntity$3(RestHighLevelClient.java:549)
at 
org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:580)
at 
org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:621)
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:375)
at org.elasticsearch.client.RestClient$1.completed(RestClient.java:366)
at 
org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
at 
org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
at 
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
at 
org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
at 
org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
at 
org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
at 
org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
at 
org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
at 
org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
at 
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
at 
org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
at 
org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
at 
org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at 
org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
at java.lang.Thread.run(Thread.java:748)
2019-10-29 17:12:13,753 ERROR 
org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase  - 
Failed Elasticsearch item request: [:intentional invalid index:] 
ElasticsearchException[Elasticsearch exception 
[type=invalid_index_name_exception, reason=Invalid index name [:intentional 
invalid index:], must not contain the following characters [ , ", *, \, <, |, 
,, >, /, ?]]]
[:intentional invalid index:] ElasticsearchException[Elasticsearch exception 
[type=invalid_index_name_exception, reason=Invalid index name [:intentional 
invalid index:], must not contain the following characters [ , ", *, \, <, |, 
,, >, /, ?]]]
at 
org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:510)
at 
org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:421)
at 
org.elasticsearch.action.bulk.BulkItemResponse.fromXContent(BulkItemResponse.java:135)
at 
org.elasticsearch.action.bulk.BulkResponse.fromXContent(BulkResponse.java:198)
at 
org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:653)
at 
org.elasticsearch.client.RestHighLevelClient.lambda$performRequestAsyncAndParseEntity$3(RestHighLevelClient.java:549)
at 
org.elasticsearch.client.RestHighLevelCli

[jira] [Created] (FLINK-14572) BlobsCleanupITCase failed in Travis stage core - scheduler_ng

2019-10-30 Thread Gary Yao (Jira)
Gary Yao created FLINK-14572:


 Summary: BlobsCleanupITCase failed in Travis stage core - 
scheduler_ng
 Key: FLINK-14572
 URL: https://issues.apache.org/jira/browse/FLINK-14572
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao
 Fix For: 1.10.0


{noformat}
java.lang.AssertionError: 

Expected: is 
 but: was 
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanup(BlobsCleanupITCase.java:220)
at 
org.apache.flink.runtime.jobmanager.BlobsCleanupITCase.testBlobServerCleanupFinishedJob(BlobsCleanupITCase.java:133)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14555) Streaming File Sink s3 end-to-end test stalls

2019-10-29 Thread Gary Yao (Jira)
Gary Yao created FLINK-14555:


 Summary: Streaming File Sink s3 end-to-end test stalls
 Key: FLINK-14555
 URL: https://issues.apache.org/jira/browse/FLINK-14555
 Project: Flink
  Issue Type: Bug
  Components: Connectors / FileSystem, Tests
Affects Versions: 1.10.0
Reporter: Gary Yao


[https://api.travis-ci.org/v3/job/603882577/log.txt]

{noformat}
==
Running 'Streaming File Sink s3 end-to-end test'
==
TEST_DATA_DIR: 
/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-36388677539
Flink dist directory: 
/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT
Found AWS bucket [secure], running the e2e test.
Found AWS access key, running the e2e test.
Found AWS secret key, running the e2e test.
Executing test with dynamic openSSL linkage (random selection between 'dynamic' 
and 'static')
Setting up SSL with: internal OPENSSL dynamic
Using SAN 
dns:travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866,ip:10.20.1.215,ip:172.17.0.1
Certificate was added to keystore
Certificate was added to keystore
Certificate reply was installed in keystore
MAC verified OK
Setting up SSL with: rest OPENSSL dynamic
Using SAN 
dns:travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866,ip:10.20.1.215,ip:172.17.0.1
Certificate was added to keystore
Certificate was added to keystore
Certificate reply was installed in keystore
MAC verified OK
Mutual ssl auth: true
Use s3 output
Starting cluster.
Starting standalonesession daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Starting taskexecutor daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Waiting for dispatcher REST endpoint to come up...
Dispatcher REST endpoint is up.
[INFO] 1 instance(s) of taskexecutor are already running on 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Starting taskexecutor daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
[INFO] 2 instance(s) of taskexecutor are already running on 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Starting taskexecutor daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
[INFO] 3 instance(s) of taskexecutor are already running on 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Starting taskexecutor daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Submitting job.
Job (c3a9bb7d3f47d63ebccbec5acb1342cb) is running.
Waiting for job (c3a9bb7d3f47d63ebccbec5acb1342cb) to have at least 3 completed 
checkpoints ...
Killing TM
TaskManager 9227 killed.
Starting TM
[INFO] 3 instance(s) of taskexecutor are already running on 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Starting taskexecutor daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Waiting for restart to happen
Killing 2 TMs
TaskManager 8618 killed.
TaskManager 9658 killed.
Starting 2 TMs
[INFO] 2 instance(s) of taskexecutor are already running on 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Starting taskexecutor daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
[INFO] 3 instance(s) of taskexecutor are already running on 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Starting taskexecutor daemon on host 
travis-job-4624d10f-b0e3-45e1-8615-972ff1d44866.
Waiting for restart to happen
Waiting until all values have been produced
Number of produced values 18080/6


No output has been received in the last 10m0s, this potentially indicates a 
stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: 
https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

The build has been terminated
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[REMINDER] Ensuring build stability

2019-10-18 Thread Gary Yao
Hi community,

We created the bui...@flink.apache.org mailing list as an effort to be more
transparent about build instabilities, especially about issues that were
only
surfacing in CRON builds at the time, such as, Scala 2.12 and Java 9
compilation errors.

While the situation has gotten better, there still occasionally issues
concerning the CRON builds that are reported late (e.g., [1][2]). The
mailing
list was created almost 2 months ago. However, last time Chesnay checked,
there were only 21 subscribers. To continue the high quality of our
releases,
we should aim at keeping the master and the release branches stable at all
times. Therefore, I encourage everyone who is developing Flink to monitor
the
builds mailing list. To subscribe, all you need to do is to send an empty
email to:

builds-subscr...@flink.apache.org

Best,
Gary

[1] https://issues.apache.org/jira/browse/FLINK-14186
[2] https://issues.apache.org/jira/browse/FLINK-14226


[jira] [Created] (FLINK-14389) Restore task state in new DefaultScheduler

2019-10-14 Thread Gary Yao (Jira)
Gary Yao created FLINK-14389:


 Summary: Restore task state in new DefaultScheduler
 Key: FLINK-14389
 URL: https://issues.apache.org/jira/browse/FLINK-14389
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Gary Yao
Assignee: Gary Yao


The new {{DefaultScheduler}} should restore the state of restarted tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14318) JDK11 build stalls during shading

2019-10-03 Thread Gary Yao (Jira)
Gary Yao created FLINK-14318:


 Summary: JDK11 build stalls during shading
 Key: FLINK-14318
 URL: https://issues.apache.org/jira/browse/FLINK-14318
 Project: Flink
  Issue Type: Bug
  Components: Build System
Reporter: Gary Yao


JDK11 build stalls during shading.

Travis stage: e2d - misc - jdk11

https://travis-ci.org/apache/flink/builds/593022581?utm_source=slack_medium=notification

https://api.travis-ci.org/v3/job/593022629/log.txt

Relevant excerpt from logs:
{noformat}
01:53:43.889 [INFO] 

01:53:43.889 [INFO] Building flink-metrics-reporter-prometheus-test 
1.10-SNAPSHOT
01:53:43.889 [INFO] 

...
01:53:44.508 [INFO] Including org.apache.flink:force-shading:jar:1.10-SNAPSHOT 
in the shaded jar.
01:53:44.508 [INFO] Excluding org.slf4j:slf4j-api:jar:1.7.15 from the shaded 
jar.
01:53:44.508 [INFO] Excluding com.google.code.findbugs:jsr305:jar:1.3.9 from 
the shaded jar.
01:53:44.508 [INFO] No artifact matching filter io.netty:netty
01:53:44.522 [INFO] Replacing original artifact with shaded artifact.
01:53:44.523 [INFO] Replacing 
/home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT.jar
 with 
/home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-shaded.jar
01:53:44.524 [INFO] Replacing original test artifact with shaded test artifact.
01:53:44.524 [INFO] Replacing 
/home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-tests.jar
 with 
/home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/flink-metrics-reporter-prometheus-test-1.10-SNAPSHOT-shaded-tests.jar
01:53:44.524 [INFO] Dependency-reduced POM written at: 
/home/travis/build/apache/flink/flink-end-to-end-tests/flink-metrics-reporter-prometheus-test/target/dependency-reduced-pom.xml
{noformat}

No output has been received in the last 10m0s, this potentially indicates a 
stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: 
https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

The build has been terminated




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   >