[jira] [Created] (FLINK-22520) KafkaSourceLegacyITCase.testMultipleSourcesOnePartition hangs

2021-04-28 Thread Guowei Ma (Jira)
Guowei Ma created FLINK-22520:
-

 Summary: KafkaSourceLegacyITCase.testMultipleSourcesOnePartition 
hangs
 Key: FLINK-22520
 URL: https://issues.apache.org/jira/browse/FLINK-22520
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.14.0
Reporter: Guowei Ma


There is no any error messages.
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17363=logs=c5f0071e-1851-543e-9a45-9ac140befc32=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5=42023


{code:java}
"main" #1 prio=5 os_prio=0 tid=0x7f4d3400b000 nid=0x203f waiting on 
condition [0x7f4d3be2e000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xa68f3b68> (a 
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:49)
at 
org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.runMultipleSourcesOnePartitionExactlyOnceTest(KafkaConsumerTestBase.java:1112)
at 
org.apache.flink.connector.kafka.source.KafkaSourceLegacyITCase.testMultipleSourcesOnePartition(KafkaSourceLegacyITCase.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.13.0, release candidate #2

2021-04-28 Thread Zhu Zhu
+1 (binding)
 - checked signatures and checksums
 - built from source
 - ran example jobs with parallelism=8000 on a YARN cluster. checked output
and logs.
 - the website PR looks good

Thanks,
Zhu

Matthias Pohl  于2021年4月29日周四 上午2:34写道:

> Thanks Dawid and Guowei for managing this release.
>
> - downloaded the sources and binaries and checked the checksums
> - built Flink from the downloaded sources
> - executed example jobs with standalone deployments - I didn't find
> anything suspicious in the logs
> - reviewed release announcement pull request
>
> - I did a pass over dependency updates: git diff release-1.12.2
> release-1.13.0-rc2 */*.xml
> There's one thing someone should double-check whether that's suppose to be
> like that: We added org.conscrypt:conscrypt-openjdk-uber:2.5.1 as a
> dependency but I don't see it being reflected in the NOTICE file of the
> flink-python module. Or is this automatically added later on?
>
> +1 (non-binding; please see remark on dependency above)
>
> Matthias
>
> On Wed, Apr 28, 2021 at 1:52 PM Stephan Ewen  wrote:
>
> > Glad to hear that outcome. And no worries about the false alarm.
> > Thank you for doing thorough testing, this is very helpful!
> >
> > On Wed, Apr 28, 2021 at 1:04 PM Caizhi Weng 
> wrote:
> >
> > > After the investigation we found that this issue is caused by the
> > > implementation of connector, not by the Flink framework.
> > >
> > > Sorry for the false alarm.
> > >
> > > Stephan Ewen  于2021年4月28日周三 下午3:23写道:
> > >
> > > > @Caizhi and @Becket - let me reach out to you to jointly debug this
> > > issue.
> > > >
> > > > I am wondering if there is some incorrect reporting of failed events?
> > > >
> > > > On Wed, Apr 28, 2021 at 8:53 AM Caizhi Weng 
> > > wrote:
> > > >
> > > > > -1
> > > > >
> > > > > We're testing this version on batch jobs with large (600~1000)
> > > > parallelisms
> > > > > and the following exception messages appear with high frequency:
> > > > >
> > > > > 2021-04-27 21:27:26
> > > > > org.apache.flink.util.FlinkException: An OperatorEvent from an
> > > > > OperatorCoordinator to a task was lost. Triggering task failover to
> > > > ensure
> > > > > consistency. Event: '[NoMoreSplitEvent]', targetTask:  -
> > > > > execution #0
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$0(SubtaskGatewayImpl.java:81)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> > > > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > > > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > > > > at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > > > > at akka.japi.pf
> > .UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > > > > at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > > > > at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > > > at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > > > at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > > > > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > > > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > > > > at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > > > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > > > > at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > > > > at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > > > > at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > > > > at
> > > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > > >
> > > > > Becket Qin is investigating this issue.
> > > > >
> > > >
> > >
>


[jira] [Created] (FLINK-22519) Have python-archives also take tar.gz

2021-04-28 Thread Yik San Chan (Jira)
Yik San Chan created FLINK-22519:


 Summary: Have python-archives also take tar.gz
 Key: FLINK-22519
 URL: https://issues.apache.org/jira/browse/FLINK-22519
 Project: Flink
  Issue Type: New Feature
  Components: API / Python
Reporter: Yik San Chan


[python-archives|https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-archives]
 currently only takes zip.

 

In our use case, we want to package the whole conda environment into 
python-archives, similar to how the 
[docs|https://ci.apache.org/projects/flink/flink-docs-stable/dev/python/faq.html#cluster]
 suggest about using venv (Python virtual environment). As we use PyFlink for 
ML, there are inevitably a few large dependencies (tensorflow, torch, pyarrow), 
as well as a lot of small dependencies.

 

This pattern is not friendly for zip. According to the 
[post|https://superuser.com/a/173825], zip compresses each file independently, 
and it is not performing good when dealing with a lot of small files. On the 
other hand, tar simply bundles all files into a tarball, then we can apply gzip 
to the whole tarball to achieve smaller size. This may explain why the official 
packaging tool - conda pack -  [conda pack|https://conda.github.io/conda-pack/] 
- produces tar.gz by default, even though zip is an option if we really want to.

 

To further prove the idea, I use my laptop and conda env to run an experiment. 
My OS: macOS 10.15.7
 # Create an environment.yaml as well as a requirements.txt
 # Run `conda env create -f environment.yaml` to create the conda env
 
 # Run conda pack to produce a tar.gz
 # Run conda pack faetflow-ml-env.zip to produce a zip

 

# environment.yaml

 
name: featflow-ml-env
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- python=3.7
- pytorch=1.8.0
- scikit-learn=0.23.2
- pip
- pip:
- -r file:requirements.txt
 
#requirements.txt
apache-flink==1.12.0
deepctr-torch==0.2.6
black==20.8b1
confluent-kafka==1.6.0
pytest==6.2.2
testcontainers==3.4.0
kafka-python==2.0.2
 
End result: the tar.gz is 854M, the zip is 1.6G

 

So, long story short, python-archives only support zip, while zip is not a good 
choice for packaging ML libs. Let's change this by adding python-archives 
tar.gz support.

 

Change will happen in this way: In ProcessPythonEnvironmentManager.java, check 
the suffix. If tar.gz, unarchive it using gzip decompresser.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22518) Translate the page of "High Availability (HA)" into Chinese

2021-04-28 Thread movesan (Jira)
movesan created FLINK-22518:
---

 Summary: Translate the page of "High Availability (HA)" into 
Chinese
 Key: FLINK-22518
 URL: https://issues.apache.org/jira/browse/FLINK-22518
 Project: Flink
  Issue Type: New Feature
  Components: chinese-translation
Reporter: movesan


 

The model of "High Availability (HA)" contains the following three pages: 

[https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/deployment/ha]

[https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/deployment/ha/zookeeper_ha.html|https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/deployment/ha/zookeeper_ha.html,]

[https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/deployment/ha/kubernetes_ha.html]

The markdown file can be found in 
[https://github.com/apache/flink/tree/master/docs/content.zh/docs/deployment/ha]
 in English.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.13.0, release candidate #2

2021-04-28 Thread Guowei Ma
Hi, Matthias

Thank you very much for your careful inspection.
I check the flink-python_2.11-1.13.0.jar and we do not bundle
org.conscrypt:conscrypt-openjdk-uber:2.5.1 to it.
So I think we may not need to add this to the NOTICE file. (BTW The jar's
scope is runtime)

Best,
Guowei


On Thu, Apr 29, 2021 at 2:33 AM Matthias Pohl 
wrote:

> Thanks Dawid and Guowei for managing this release.
>
> - downloaded the sources and binaries and checked the checksums
> - built Flink from the downloaded sources
> - executed example jobs with standalone deployments - I didn't find
> anything suspicious in the logs
> - reviewed release announcement pull request
>
> - I did a pass over dependency updates: git diff release-1.12.2
> release-1.13.0-rc2 */*.xml
> There's one thing someone should double-check whether that's suppose to be
> like that: We added org.conscrypt:conscrypt-openjdk-uber:2.5.1 as a
> dependency but I don't see it being reflected in the NOTICE file of the
> flink-python module. Or is this automatically added later on?
>
> +1 (non-binding; please see remark on dependency above)
>
> Matthias
>
> On Wed, Apr 28, 2021 at 1:52 PM Stephan Ewen  wrote:
>
> > Glad to hear that outcome. And no worries about the false alarm.
> > Thank you for doing thorough testing, this is very helpful!
> >
> > On Wed, Apr 28, 2021 at 1:04 PM Caizhi Weng 
> wrote:
> >
> > > After the investigation we found that this issue is caused by the
> > > implementation of connector, not by the Flink framework.
> > >
> > > Sorry for the false alarm.
> > >
> > > Stephan Ewen  于2021年4月28日周三 下午3:23写道:
> > >
> > > > @Caizhi and @Becket - let me reach out to you to jointly debug this
> > > issue.
> > > >
> > > > I am wondering if there is some incorrect reporting of failed events?
> > > >
> > > > On Wed, Apr 28, 2021 at 8:53 AM Caizhi Weng 
> > > wrote:
> > > >
> > > > > -1
> > > > >
> > > > > We're testing this version on batch jobs with large (600~1000)
> > > > parallelisms
> > > > > and the following exception messages appear with high frequency:
> > > > >
> > > > > 2021-04-27 21:27:26
> > > > > org.apache.flink.util.FlinkException: An OperatorEvent from an
> > > > > OperatorCoordinator to a task was lost. Triggering task failover to
> > > > ensure
> > > > > consistency. Event: '[NoMoreSplitEvent]', targetTask:  -
> > > > > execution #0
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$0(SubtaskGatewayImpl.java:81)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> > > > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > > > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > > > > at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > > > > at akka.japi.pf
> > .UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > > > > at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > > > > at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > > > at
> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > > > at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > > > > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > > > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > > > > at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > > > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > > > > at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > > > > at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > > > > at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > > > > at
> > > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > > > at
> > > > >
> > > > >
> > > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > > >
> > > > > Becket Qin is investigating this 

[jira] [Created] (FLINK-22517) Fix pickle compatibility problem in different Python versions

2021-04-28 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-22517:


 Summary: Fix pickle compatibility problem in different Python 
versions
 Key: FLINK-22517
 URL: https://issues.apache.org/jira/browse/FLINK-22517
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.12.3, 1.13.0
Reporter: Huang Xingbo
Assignee: Huang Xingbo


Since release-1.12, PyFlink has supported Python3 8. Starting from Python 3.8, 
the default protocol version used by pickle is 
pickle5(https://www.python.org/dev/peps/pep-0574/), which will raising the 
following exception if the client uses python 3.8 to compile program and the 
cluster node uses python 3.7 or python 3.6 to run python udf:

{code:python}
ValueError: unsupported pickle protocol: 5
{code}

The workaround is to first let the python version used by the client be 3.6 or 
3.7. For how to specify the client-side python execution environment, please 
refer to the 
doc(https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/python_config.html#python-client-executable).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22516) ResourceManager cannot establish leadership

2021-04-28 Thread Ricky Burnett (Jira)
Ricky Burnett created FLINK-22516:
-

 Summary: ResourceManager cannot establish leadership
 Key: FLINK-22516
 URL: https://issues.apache.org/jira/browse/FLINK-22516
 Project: Flink
  Issue Type: Bug
Reporter: Ricky Burnett
 Attachments: jobmanager_leadership.log

We are running Flink clusters with 2 Jobmanagers in HA mode.  After a Zookeeper 
restart the two JMs begin leadership election end up in state where they are 
both trying to start their ResourceManager and until one of them writes to 
`leader//resource_manager_lock` and the JobMaster proceeds to execute 
`notifyOfNewResourceManagerLeader` which restarts the ResourceManager.  This in 
turn writes to `leader//resource_manager_lock` which triggers the other 
JobMaster to restart it's ResourceManager.  We can see this in the logs from 
the "ResourceManager leader changed to new address" log, that goes back and 
forth between the two JMs and the two IP addresses.  This cycle appears to 
continue indefinitely with outside interruption.  

I've attached combined logs from two JMs in our environment that got into this 
state.  The logs start with the loss of connection and end with a couple of 
cycles of back and forth.   The two relevant hosts are 
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7" and 
"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-mpf9x".

*-tsxb7 appears to be the last host that was granted leadership. 
{code:java}
{"thread":"Curator-Framework-0-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.jobmaster.JobManagerRunner","message":"JobManager
 runner for job tenant: ssademo, pipeline: 
828d4aa2-d4d4-457b-995d-feb56d08c1fb, name: integration-test-detection 
(33e12948df69077ab3b33316eacbb5e4) was granted leadership with session id 
97992805-9c60-40ba-8260-aaf036694cde at 
akka.tcp://flink@100.97.92.73:6123/user/jobmanager_3.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","instant":{"epochSecond":1617129712,"nanoOfSecond":44700},"contextMap":{},"threadId":152,"threadPriority":5,"source":{"class":"org.apache.flink.runtime.jobmaster.JobManagerRunner","method":"startJobMaster","file":"JobManagerRunner.java","line":313},"service":"streams","time":"2021-03-30T18:41:52.447UTC","hostname":"flink-jm-828d4aa2-d4d4-457b-995d-feb56d08c1fb-784cdb9c57-tsxb7"}
{code}
But  *-mpf9x continues to try to wrestle control back.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.13.0, release candidate #2

2021-04-28 Thread Matthias Pohl
Thanks Dawid and Guowei for managing this release.

- downloaded the sources and binaries and checked the checksums
- built Flink from the downloaded sources
- executed example jobs with standalone deployments - I didn't find
anything suspicious in the logs
- reviewed release announcement pull request

- I did a pass over dependency updates: git diff release-1.12.2
release-1.13.0-rc2 */*.xml
There's one thing someone should double-check whether that's suppose to be
like that: We added org.conscrypt:conscrypt-openjdk-uber:2.5.1 as a
dependency but I don't see it being reflected in the NOTICE file of the
flink-python module. Or is this automatically added later on?

+1 (non-binding; please see remark on dependency above)

Matthias

On Wed, Apr 28, 2021 at 1:52 PM Stephan Ewen  wrote:

> Glad to hear that outcome. And no worries about the false alarm.
> Thank you for doing thorough testing, this is very helpful!
>
> On Wed, Apr 28, 2021 at 1:04 PM Caizhi Weng  wrote:
>
> > After the investigation we found that this issue is caused by the
> > implementation of connector, not by the Flink framework.
> >
> > Sorry for the false alarm.
> >
> > Stephan Ewen  于2021年4月28日周三 下午3:23写道:
> >
> > > @Caizhi and @Becket - let me reach out to you to jointly debug this
> > issue.
> > >
> > > I am wondering if there is some incorrect reporting of failed events?
> > >
> > > On Wed, Apr 28, 2021 at 8:53 AM Caizhi Weng 
> > wrote:
> > >
> > > > -1
> > > >
> > > > We're testing this version on batch jobs with large (600~1000)
> > > parallelisms
> > > > and the following exception messages appear with high frequency:
> > > >
> > > > 2021-04-27 21:27:26
> > > > org.apache.flink.util.FlinkException: An OperatorEvent from an
> > > > OperatorCoordinator to a task was lost. Triggering task failover to
> > > ensure
> > > > consistency. Event: '[NoMoreSplitEvent]', targetTask:  -
> > > > execution #0
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$0(SubtaskGatewayImpl.java:81)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
> > > > at
> > > >
> > > >
> > >
> >
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> > > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > > > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > > > at akka.japi.pf
> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > > > at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > > > at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > > at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > > at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > > > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > > > at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > > > at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > > > at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > > > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > > at
> > > >
> > > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > > > at
> > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > > at
> > > >
> > > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > > >
> > > > Becket Qin is investigating this issue.
> > > >
> > >
> >


[jira] [Created] (FLINK-22515) Add Documentation for Flink Glue Schema Registry Integration

2021-04-28 Thread Linyu Yao (Jira)
Linyu Yao created FLINK-22515:
-

 Summary: Add Documentation for Flink Glue Schema Registry 
Integration
 Key: FLINK-22515
 URL: https://issues.apache.org/jira/browse/FLINK-22515
 Project: Flink
  Issue Type: New Feature
  Components: Documentation
Reporter: Linyu Yao


Add documentation for Flink Glue Schema Registry integration to the page of 
Kafka Connector and Kinesis Connector:

[https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/kafka.html#apache-kafka-connector]

[https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/kinesis.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22514) TypeExtractor - Improving log message

2021-04-28 Thread Avishay Balderman (Jira)
Avishay Balderman created FLINK-22514:
-

 Summary: TypeExtractor - Improving log message 
 Key: FLINK-22514
 URL: https://issues.apache.org/jira/browse/FLINK-22514
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream
Reporter: Avishay Balderman


org.apache.flink.api.java.typeutils.TypeExtractor is checking if a field in a 
class is a "valid POJO field" .

The method that is responsible for this is:  
{code:java}
isValidPojoField{code}
When isValidPojoField find an invalid field a log message is written (see 
below) but the log message does not tell which field is invalid...

So now the developer needs to find out the "bad" field.

Adding the field info to the log message is easy and save the developer time.

 

 
{code:java}
for (Field field : fields) {
   Type fieldType = field.getGenericType();
   if(!isValidPojoField(field, clazz, typeHierarchy) && clazz != Row.class) {
  LOG.info("Class " + clazz + " cannot be used as a POJO type because not 
all fields are valid POJO fields, " +
 "and must be processed as GenericType. Please read the Flink 
documentation " +
 "on \"Data Types & Serialization\" for details of the effect on 
performance.");
  return null;
   }
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22513) Failed to upload compile artifacts

2021-04-28 Thread Dawid Wysakowicz (Jira)
Dawid Wysakowicz created FLINK-22513:


 Summary: Failed to upload compile artifacts
 Key: FLINK-22513
 URL: https://issues.apache.org/jira/browse/FLINK-22513
 Project: Flink
  Issue Type: Bug
  Components: Build System / Azure Pipelines
Affects Versions: 1.13.0
Reporter: Dawid Wysakowicz


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17343=logs=52b61abe-a3cc-5bde-cc35-1bbe89bb7df5=5e2d176c-a6d4-5b54-7808-b11714e363ad



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Does flink support multi cluster deployment with HA on multi k8s clusters

2021-04-28 Thread bhagi@R
Hi Team,

production requirement is to deploy flink in multi cluster mode,
i.e deploying flink cluster1 with HA on kubernetes cluster1 in data center1
& another flink cluster2  with HA  on kubernetes cluster2 in data center2 ..
if Flink cluster1 goes down on k8s cluster1 on DC1 ,it has to fail over to
Flink cluster2 on k8s cluster2 on DC2.

It has to failover with automatic HA mechanism.
Please let us know, whether is this possible ??
or any solution is provided to have flink in multi cluster mode with HA ..


Any solution please share the information

Note: Currently deployed Flink session cluster on standalone kubernetes with
HA(Kubernetes HA)



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/


[jira] [Created] (FLINK-22512) Can't call current_timestamp with hive dialect for hive-3.1

2021-04-28 Thread Rui Li (Jira)
Rui Li created FLINK-22512:
--

 Summary: Can't call current_timestamp with hive dialect for 
hive-3.1
 Key: FLINK-22512
 URL: https://issues.apache.org/jira/browse/FLINK-22512
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive
Reporter: Rui Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: flink1.12.1 read hive orc table exception. could not initialize class org.apache.orc.impl.ZlibCodec

2021-04-28 Thread Rui Li
Hi WSJian,

The pictures can't be displayed. Please just copy/paste the stack trace.

And btw, the user mailing list (u...@flink.apache.org) is usually more
appropriate for such issues.

On Wed, Apr 28, 2021 at 11:36 AM WSJian <526165...@qq.com> wrote:

> flink lib jar:
>
> flink taskmanager log:
>
> hive orc table create ddl:
> create table xxx ... stored as orc
>
> flink java class, Join the hive Orc table and Kafka stream data:
>
> bsTableEnv.executeSql("my sql is join");
>
> flink pom.xml in the attachment.
>
>
> I convert the file type to textfile type to run. But orc throw this exception.
>
> Or I missing any dependency on the flink lib? Or class file conflict?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

-- 
Best regards!
Rui Li


[RESULT][VOTE] Release 1.12.3, release candidate #1

2021-04-28 Thread Arvid Heise
Dear devs,

I'm happy to announce that we have unanimously approved this release.

There are 3 approving votes, 3 of which are binding:
* Dian
* Dawid
* Robert

There are no disapproving votes.

Thanks everyone!

Your friendly release manager Arvid

On Wed, Apr 28, 2021 at 2:25 PM Arvid Heise  wrote:

> According to the latest exchange on "[VOTE] Release 1.13.0, release
> candidate #2", it was a false alarm, so I'm going ahead with the release.
>
> On Wed, Apr 28, 2021 at 9:26 AM Stephan Ewen  wrote:
>
>> A quick heads-up:
>>
>> A fix from 1.13.0 that I backported to 1.12.3 is apparently causing some
>> issues at larger batch scale.
>> We are investigating this, but it would affect this release as well.
>>
>> Please see the mail thread "[VOTE] Release 1.13.0, release candidate #2"
>> for details.
>>
>> If we want to make sure this release isn't affected, we revert that issue.
>> The tickets are:
>>   - https://issues.apache.org/jira/browse/FLINK-21996
>>   - https://issues.apache.org/jira/browse/FLINK-18071
>>
>> On Wed, Apr 28, 2021 at 7:59 AM Robert Metzger 
>> wrote:
>>
>> > +1 (binding)
>> >
>> > - started cluster, ran example job on macos
>> > - sources look fine
>> > - Eyeballed the diff:
>> >
>> https://github.com/apache/flink/compare/release-1.12.2...release-1.12.3-rc1
>> > .
>> > According to "git diff release-1.12.2...release-1.12.3-rc1 '*.xml'",
>> there
>> > was only one external dependency change (snappy-java, which seems to be
>> > properly reflected in the NOTICE file)
>> > - the last CI run of the "release-1.12" branch is okay-ish:
>> >
>> >
>> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17315=results
>> > (this one failure occurred:
>> > https://issues.apache.org/jira/browse/FLINK-20950)
>> >
>> >
>> > On Tue, Apr 27, 2021 at 5:03 PM Dawid Wysakowicz <
>> dwysakow...@apache.org>
>> > wrote:
>> >
>> > > +1 (binding)
>> > >
>> > > - Verified checksums and signatures
>> > > - Reviewed the website PR
>> > > - Built from sources
>> > > - verified dependency version upgrades and updates in NOTICE files
>> > > compared to 1.12.2
>> > > - started cluster and run WordCount example in BATCH mode and
>> everything
>> > > looked good
>> > >
>> > > On 23/04/2021 23:52, Arvid Heise wrote:
>> > > > Hi everyone,
>> > > > Please review and vote on the release candidate #1 for the version
>> > > 1.12.3,
>> > > > as follows:
>> > > > [ ] +1, Approve the release
>> > > > [ ] -1, Do not approve the release (please provide specific
>> comments)
>> > > >
>> > > >
>> > > > The complete staging area is available for your review, which
>> includes:
>> > > > * JIRA release notes [1],
>> > > > * the official Apache source release and binary convenience
>> releases to
>> > > be
>> > > > deployed to dist.apache.org [2], which are signed with the key with
>> > > > fingerprint 476DAA5D1FF08189 [3],
>> > > > * all artifacts to be deployed to the Maven Central Repository [4],
>> > > > * source code tag "release-1.2.3-rc3" [5],
>> > > > * website pull request listing the new release and adding
>> announcement
>> > > blog
>> > > > post [6].
>> > > >
>> > > > The vote will be open for at least 72 hours. It is adopted by
>> majority
>> > > > approval, with at least 3 PMC affirmative votes.
>> > > >
>> > > > Thanks,
>> > > > Your friendly release manager Arvid
>> > > >
>> > > > [1]
>> > > >
>> > >
>> >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12349691
>> > > > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.3-rc1/
>> > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> > > > [4]
>> > >
>> https://repository.apache.org/content/repositories/orgapacheflink-1419
>> > > > [5] https://github.com/apache/flink/releases/tag/release-1.12.3-rc1
>> > > > [6] https://github.com/apache/flink-web/pull/437
>> > > >
>> > >
>> > >
>> >
>>
>


Re: [VOTE] Release 1.12.3, release candidate #1

2021-04-28 Thread Arvid Heise
According to the latest exchange on "[VOTE] Release 1.13.0, release
candidate #2", it was a false alarm, so I'm going ahead with the release.

On Wed, Apr 28, 2021 at 9:26 AM Stephan Ewen  wrote:

> A quick heads-up:
>
> A fix from 1.13.0 that I backported to 1.12.3 is apparently causing some
> issues at larger batch scale.
> We are investigating this, but it would affect this release as well.
>
> Please see the mail thread "[VOTE] Release 1.13.0, release candidate #2"
> for details.
>
> If we want to make sure this release isn't affected, we revert that issue.
> The tickets are:
>   - https://issues.apache.org/jira/browse/FLINK-21996
>   - https://issues.apache.org/jira/browse/FLINK-18071
>
> On Wed, Apr 28, 2021 at 7:59 AM Robert Metzger 
> wrote:
>
> > +1 (binding)
> >
> > - started cluster, ran example job on macos
> > - sources look fine
> > - Eyeballed the diff:
> >
> https://github.com/apache/flink/compare/release-1.12.2...release-1.12.3-rc1
> > .
> > According to "git diff release-1.12.2...release-1.12.3-rc1 '*.xml'",
> there
> > was only one external dependency change (snappy-java, which seems to be
> > properly reflected in the NOTICE file)
> > - the last CI run of the "release-1.12" branch is okay-ish:
> >
> >
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17315=results
> > (this one failure occurred:
> > https://issues.apache.org/jira/browse/FLINK-20950)
> >
> >
> > On Tue, Apr 27, 2021 at 5:03 PM Dawid Wysakowicz  >
> > wrote:
> >
> > > +1 (binding)
> > >
> > > - Verified checksums and signatures
> > > - Reviewed the website PR
> > > - Built from sources
> > > - verified dependency version upgrades and updates in NOTICE files
> > > compared to 1.12.2
> > > - started cluster and run WordCount example in BATCH mode and
> everything
> > > looked good
> > >
> > > On 23/04/2021 23:52, Arvid Heise wrote:
> > > > Hi everyone,
> > > > Please review and vote on the release candidate #1 for the version
> > > 1.12.3,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint 476DAA5D1FF08189 [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "release-1.2.3-rc3" [5],
> > > > * website pull request listing the new release and adding
> announcement
> > > blog
> > > > post [6].
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > Your friendly release manager Arvid
> > > >
> > > > [1]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12349691
> > > > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.3-rc1/
> > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapacheflink-1419
> > > > [5] https://github.com/apache/flink/releases/tag/release-1.12.3-rc1
> > > > [6] https://github.com/apache/flink-web/pull/437
> > > >
> > >
> > >
> >
>


[jira] [Created] (FLINK-22511) Fix the bug of non-composite result type in Python TableAggregateFunction

2021-04-28 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-22511:


 Summary: Fix the bug of non-composite result type in Python 
TableAggregateFunction
 Key: FLINK-22511
 URL: https://issues.apache.org/jira/browse/FLINK-22511
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.13.0
Reporter: Huang Xingbo
Assignee: Huang Xingbo






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22510) ConfigurationUtils always serializes Duration to nanoseconds

2021-04-28 Thread Jira
Ingo Bürk created FLINK-22510:
-

 Summary: ConfigurationUtils always serializes Duration to 
nanoseconds
 Key: FLINK-22510
 URL: https://issues.apache.org/jira/browse/FLINK-22510
 Project: Flink
  Issue Type: Improvement
Affects Versions: 1.12.2, 1.13.0
Reporter: Ingo Bürk


ConfigurationUtils#convertToString is hard-coded to serialize instances of 
Duration into nanoseconds. This often produces output which isn't really 
human-understandable. It would be good to be a bit smarter here in how this is 
serialized to make it more natural to humans.

A possible solution could be choosing the lowest unit with which the value can 
be expressed as an integer, e.g. "1 min", "35 s", "3256 ms", …



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.13.0, release candidate #2

2021-04-28 Thread Stephan Ewen
Glad to hear that outcome. And no worries about the false alarm.
Thank you for doing thorough testing, this is very helpful!

On Wed, Apr 28, 2021 at 1:04 PM Caizhi Weng  wrote:

> After the investigation we found that this issue is caused by the
> implementation of connector, not by the Flink framework.
>
> Sorry for the false alarm.
>
> Stephan Ewen  于2021年4月28日周三 下午3:23写道:
>
> > @Caizhi and @Becket - let me reach out to you to jointly debug this
> issue.
> >
> > I am wondering if there is some incorrect reporting of failed events?
> >
> > On Wed, Apr 28, 2021 at 8:53 AM Caizhi Weng 
> wrote:
> >
> > > -1
> > >
> > > We're testing this version on batch jobs with large (600~1000)
> > parallelisms
> > > and the following exception messages appear with high frequency:
> > >
> > > 2021-04-27 21:27:26
> > > org.apache.flink.util.FlinkException: An OperatorEvent from an
> > > OperatorCoordinator to a task was lost. Triggering task failover to
> > ensure
> > > consistency. Event: '[NoMoreSplitEvent]', targetTask:  -
> > > execution #0
> > > at
> > >
> > >
> >
> org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$0(SubtaskGatewayImpl.java:81)
> > > at
> > >
> > >
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
> > > at
> > >
> > >
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
> > > at
> > >
> > >
> >
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> > > at
> > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> > > at
> > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> > > at
> > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> > > at
> > >
> > >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > > at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > > at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > > at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > > at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > at
> > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > > at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > at
> > >
> > >
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > >
> > > Becket Qin is investigating this issue.
> > >
> >
>


[jira] [Created] (FLINK-22509) ./bin/flink run -m yarn-cluster -d submission leads to IllegalStateException

2021-04-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22509:
--

 Summary: ./bin/flink run -m yarn-cluster -d submission leads to 
IllegalStateException
 Key: FLINK-22509
 URL: https://issues.apache.org/jira/browse/FLINK-22509
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.13.0, 1.14.0
Reporter: Robert Metzger


Submitting a detached, per-job YARN cluster in Flink (like this: {{./bin/flink 
run -m yarn-cluster -d  ./examples/streaming/TopSpeedWindowing.jar}}), leads to 
the following exception:

{code}
2021-04-28 11:39:00,786 INFO  org.apache.flink.yarn.YarnClusterDescriptor   
   [] - Found Web Interface 
ip-172-31-27-232.eu-central-1.compute.internal:45689 of application 
'application_1619607372651_0005'.
Job has been submitted with JobID 5543e81db9c2de78b646088891f23bfc
Exception in thread "Thread-4" java.lang.IllegalStateException: Trying to 
access closed classloader. Please check if you store classloaders directly or 
indirectly in static fields. If the stacktrace suggests that the leak occurs in 
a third party library and cannot be fixed immediately, you can disable this 
check with the configuration 'classloader.check-leaked-classloader'.
at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:164)
at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResource(FlinkUserCodeClassLoaders.java:183)
at 
org.apache.hadoop.conf.Configuration.getResource(Configuration.java:2570)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2783)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2758)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2638)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1100)
at 
org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1707)
at 
org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1688)
at 
org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
at 
org.apache.hadoop.util.ShutdownHookManager.shutdownExecutor(ShutdownHookManager.java:145)
at 
org.apache.hadoop.util.ShutdownHookManager.access$300(ShutdownHookManager.java:65)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:102)
{code}

The job is still running as expected.
Detached submission with {{./bin/flink run-application -t yarn-application -d}} 
works as expected. This is also the documented approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release 1.13.0, release candidate #2

2021-04-28 Thread Caizhi Weng
After the investigation we found that this issue is caused by the
implementation of connector, not by the Flink framework.

Sorry for the false alarm.

Stephan Ewen  于2021年4月28日周三 下午3:23写道:

> @Caizhi and @Becket - let me reach out to you to jointly debug this issue.
>
> I am wondering if there is some incorrect reporting of failed events?
>
> On Wed, Apr 28, 2021 at 8:53 AM Caizhi Weng  wrote:
>
> > -1
> >
> > We're testing this version on batch jobs with large (600~1000)
> parallelisms
> > and the following exception messages appear with high frequency:
> >
> > 2021-04-27 21:27:26
> > org.apache.flink.util.FlinkException: An OperatorEvent from an
> > OperatorCoordinator to a task was lost. Triggering task failover to
> ensure
> > consistency. Event: '[NoMoreSplitEvent]', targetTask:  -
> > execution #0
> > at
> >
> >
> org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$0(SubtaskGatewayImpl.java:81)
> > at
> >
> >
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
> > at
> >
> >
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
> > at
> >
> >
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> > at
> >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> > at
> >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> > at
> >
> >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> > at
> >
> >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> > at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> > at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> > at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> > at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > at
> >
> >
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > at
> >
> >
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >
> > Becket Qin is investigating this issue.
> >
>


[jira] [Created] (FLINK-22508) add serialVersionUID for all Serializable class in non-table module

2021-04-28 Thread lincoln lee (Jira)
lincoln lee created FLINK-22508:
---

 Summary: add serialVersionUID for all Serializable class in 
non-table module
 Key: FLINK-22508
 URL: https://issues.apache.org/jira/browse/FLINK-22508
 Project: Flink
  Issue Type: Sub-task
Reporter: lincoln lee






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22507) add serialVersionUID for all Serializable class in table module

2021-04-28 Thread lincoln lee (Jira)
lincoln lee created FLINK-22507:
---

 Summary: add serialVersionUID for all Serializable class in table 
module
 Key: FLINK-22507
 URL: https://issues.apache.org/jira/browse/FLINK-22507
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Runtime
Reporter: lincoln lee






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22506) YARN job cluster stuck in retrying creating JobManager if savepoint is corrupted

2021-04-28 Thread Paul Lin (Jira)
Paul Lin created FLINK-22506:


 Summary: YARN job cluster stuck in retrying creating JobManager if 
savepoint is corrupted
 Key: FLINK-22506
 URL: https://issues.apache.org/jira/browse/FLINK-22506
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.11.3
Reporter: Paul Lin


If a non-retryable error (e.g. the savepoint is corrupted or unaccessible) 
occurs during the initiation of the job manager, the job cluster exits with an 
error code. But since it does not mark the attempt as failed, it won't be count 
as a failed attempt, and YARN will keep retrying forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22505) Limit the precision of Resource

2021-04-28 Thread Yangze Guo (Jira)
Yangze Guo created FLINK-22505:
--

 Summary: Limit the precision of Resource
 Key: FLINK-22505
 URL: https://issues.apache.org/jira/browse/FLINK-22505
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Yangze Guo


In our internal deployment, we found that a high precision {{CPUResource}} may 
cause the required resource never to be fulfilled. Think about the following 
scenario:
- The {{SlotManager}} receives a slot request with 1.001 CPU and 
decides to allocate a pending task manager with that resource spec.
- The resource manager starts a task manager and sets the CPU by dynamic 
config. In this step, we cast the {{CPUResource}} to a double value, where the 
precision loss happens.
The task manager will finally register with 1.0 CPU and thus can not deduct any 
pending task manager or fulfill the slot request.

To solve that issue, we proposed to limit the precision of Resource to a safe 
value, e.g. 8, to prevent the precision loss when cast to double.
- For {{CPUResource}}, the supported scale for the CPU is 3 in k8s while in 
Yarn, the CPU should be an integer.
- For {{ExternalResource}}, the value will always be treated as an integer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22503) getting eroor on select query in flink sql

2021-04-28 Thread Bhagi (Jira)
Bhagi created FLINK-22503:
-

 Summary: getting eroor on select query in flink sql 
 Key: FLINK-22503
 URL: https://issues.apache.org/jira/browse/FLINK-22503
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.12.2
 Environment: Hi Team,

I added source table *orders* with filesystem connector using csv file. when i 
am running the select * from orders, its throwing error. even i chaged the mode 
also not working.

 !image-2021-04-28-13-33-36-784.png! 
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.client.gateway.SqlExecutionException: Results of batch 
queries can only be served in table or tableau mode

2) I used sink as Elasticsearch  , but not able to see the data in 
elasticsearch index..  Can you please check and help me two issues
 !image-2021-04-28-13-40-42-375.png! 
!image-2021-04-28-13-43-00-455.png! 
Reporter: Bhagi
 Attachments: image-2021-04-28-13-33-36-784.png, 
image-2021-04-28-13-40-42-375.png, image-2021-04-28-13-43-00-455.png





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22502) DefaultCompletedCheckpointStore drops unrecoverable checkpoints silently

2021-04-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-22502:
-

 Summary: DefaultCompletedCheckpointStore drops unrecoverable 
checkpoints silently
 Key: FLINK-22502
 URL: https://issues.apache.org/jira/browse/FLINK-22502
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Runtime / Coordination
Affects Versions: 1.12.2, 1.11.3, 1.13.0, 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0, 1.13.1, 1.12.4


The {{DefaultCompletedCheckpointStore.recover()}} tries to be resilient if it 
cannot recover a checkpoint (e.g. due to a transient storage outage or a 
checkpoint being corrupted). This behaviour was introduced with FLINK-7783.

The problem is that this behaviour might cause us to ignore the latest valid 
checkpoint if there is a transient problem when restoring it. This might be ok 
for at least once processing guarantees, but it clearly violates exactly once 
processing guarantees. On top of it, it is very hard to spot.

I propose to change this behaviour so that 
{{DefaultCompletedCheckpointStore.recover()}} fails if it cannot read the 
checkpoints it is supposed to read. If the {{recover}} method fails during a 
recovery, it will kill the process. This will usually restart the process which 
will retry the checkpoint recover operation. If the problem is of transient 
nature, then it should eventually succeed. In case that this problem occurs 
during an initial job submission, then the job will directly transition to a 
{{FAILED}} state.

The proposed behaviour entails that if there is a permanent problem with the 
checkpoint (e.g. corrupted checkpoint), then Flink won't be able to recover 
without the intervention of the user. I believe that this is the right decision 
because Flink can no longer give exactly once guarantees in this situation and 
a user needs to explicitly resolve this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22501) Flink HiveCatalog ParquetTable join Dim table error

2021-04-28 Thread zhanglw (Jira)
zhanglw created FLINK-22501:
---

 Summary: Flink HiveCatalog ParquetTable join Dim table error
 Key: FLINK-22501
 URL: https://issues.apache.org/jira/browse/FLINK-22501
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.12.0
Reporter: zhanglw


Caused by: java.lang.IllegalArgumentException
 at java.nio.Buffer.position(Buffer.java:244)
 at 
org.apache.flink.hive.shaded.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:424)
 at 
org.apache.flink.hive.shaded.formats.parquet.vector.ParquetDictionary.decodeToBinary(ParquetDictionary.java:59)
 at 
org.apache.flink.table.data.vector.heap.HeapBytesVector.getBytes(HeapBytesVector.java:124)
 at 
org.apache.flink.table.data.vector.VectorizedColumnBatch.getByteArray(VectorizedColumnBatch.java:97)
 at 
org.apache.flink.table.data.ColumnarRowData.getString(ColumnarRowData.java:113)
 at BatchCalc$12.processElement(Unknown Source)
 at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.pushToOperator(ChainingOutput.java:112)
 at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:93)
 at 
org.apache.flink.streaming.runtime.tasks.ChainingOutput.collect(ChainingOutput.java:39)
 at 
org.apache.flink.streaming.runtime.tasks.SourceOperatorStreamTask$AsyncDataOutputToOutput.emitRecord(SourceOperatorStreamTask.java:163)
 at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:110)
 at 
org.apache.flink.streaming.api.operators.source.SourceOutputWithWatermarks.collect(SourceOutputWithWatermarks.java:101)
 at 
org.apache.flink.connector.file.src.impl.FileSourceRecordEmitter.emitRecord(FileSourceRecordEmitter.java:45)
 at 
org.apache.flink.connector.file.src.impl.FileSourceRecordEmitter.emitRecord(FileSourceRecordEmitter.java:35)
 at 
org.apache.flink.connector.base.source.reader.SourceReaderBase.pollNext(SourceReaderBase.java:126)
 at 
org.apache.flink.streaming.api.operators.SourceOperator.emitNext(SourceOperator.java:251)
 at 
org.apache.flink.streaming.runtime.io.StreamTaskSourceInput.emitNext(StreamTaskSourceInput.java:65)
 at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:372)
 at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:575)
 at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:539)
 at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547)
 at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22500) flink stop 命令找不到00000000000000000000000000000000

2021-04-28 Thread Jira
陈孝忠 created FLINK-22500:
---

 Summary: flink stop 命令找不到
 Key: FLINK-22500
 URL: https://issues.apache.org/jira/browse/FLINK-22500
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Reporter: 陈孝忠


flin 1.12.1.版本, k8s applicton mode 使用K8S 做HA
stop 命令和 run 都会报不到  

下面是 STOP 的日志

开始提交任务:2021-04-28 06:18:37开始提交任务:2021-04-28 06:18:37启动命令:/opt/flink/bin/flink 
stop  --target kubernetes-application 
-Dkubernetes.namespace=middle-flink 
-Dkubernetes.config.file=/opt/flink/conf/kubeconfig 
-Dkubernetes.cluster-id=test0001Suspending job 
"" with a savepoint.
2021-04-28 06:18:39,574 INFO  
org.apache.flink.kubernetes.KubernetesClusterDescriptor      [] - Retrieve 
flink cluster market-data-full-chain-0427 successfully, JobManager Web 
Interface: http://market-data-full-chain-0427-rest.middle-flink:10243
rs=1
 The program 
finished with the following exception:
org.apache.flink.util.FlinkException: Could not stop with a savepoint job 
"". at 
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:585) at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1006) 
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:573) at 
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1073) at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1136) 
at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
 at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1136)Caused 
by: java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job () at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at 
org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:583) ... 
6 moreCaused by: java.util.concurrent.CompletionException: 
org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job () at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:989)
 at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137) 
at 
org.apache.flink.runtime.dispatcher.Dispatcher.performOperationOnJobMasterGateway(Dispatcher.java:910)
 at 
org.apache.flink.runtime.dispatcher.Dispatcher.stopWithSavepoint(Dispatcher.java:709)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:306)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213)
 at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159)
 at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at 
scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at 
scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at 
akka.actor.Actor.aroundReceive(Actor.scala:517) at 
akka.actor.Actor.aroundReceive$(Actor.scala:515) at 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at 
akka.actor.ActorCell.invoke(ActorCell.scala:561) at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at 
akka.dispatch.Mailbox.run(Mailbox.scala:225) at 
akka.dispatch.Mailbox.exec(Mailbox.scala:235) at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)Caused
 by: 

Re: [VOTE] Release 1.12.3, release candidate #1

2021-04-28 Thread Stephan Ewen
A quick heads-up:

A fix from 1.13.0 that I backported to 1.12.3 is apparently causing some
issues at larger batch scale.
We are investigating this, but it would affect this release as well.

Please see the mail thread "[VOTE] Release 1.13.0, release candidate #2"
for details.

If we want to make sure this release isn't affected, we revert that issue.
The tickets are:
  - https://issues.apache.org/jira/browse/FLINK-21996
  - https://issues.apache.org/jira/browse/FLINK-18071

On Wed, Apr 28, 2021 at 7:59 AM Robert Metzger  wrote:

> +1 (binding)
>
> - started cluster, ran example job on macos
> - sources look fine
> - Eyeballed the diff:
> https://github.com/apache/flink/compare/release-1.12.2...release-1.12.3-rc1
> .
> According to "git diff release-1.12.2...release-1.12.3-rc1 '*.xml'", there
> was only one external dependency change (snappy-java, which seems to be
> properly reflected in the NOTICE file)
> - the last CI run of the "release-1.12" branch is okay-ish:
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17315=results
> (this one failure occurred:
> https://issues.apache.org/jira/browse/FLINK-20950)
>
>
> On Tue, Apr 27, 2021 at 5:03 PM Dawid Wysakowicz 
> wrote:
>
> > +1 (binding)
> >
> > - Verified checksums and signatures
> > - Reviewed the website PR
> > - Built from sources
> > - verified dependency version upgrades and updates in NOTICE files
> > compared to 1.12.2
> > - started cluster and run WordCount example in BATCH mode and everything
> > looked good
> >
> > On 23/04/2021 23:52, Arvid Heise wrote:
> > > Hi everyone,
> > > Please review and vote on the release candidate #1 for the version
> > 1.12.3,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > >
> > > The complete staging area is available for your review, which includes:
> > > * JIRA release notes [1],
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint 476DAA5D1FF08189 [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "release-1.2.3-rc3" [5],
> > > * website pull request listing the new release and adding announcement
> > blog
> > > post [6].
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > Your friendly release manager Arvid
> > >
> > > [1]
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12349691
> > > [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.3-rc1/
> > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > [4]
> > https://repository.apache.org/content/repositories/orgapacheflink-1419
> > > [5] https://github.com/apache/flink/releases/tag/release-1.12.3-rc1
> > > [6] https://github.com/apache/flink-web/pull/437
> > >
> >
> >
>


Re: [VOTE] Release 1.13.0, release candidate #2

2021-04-28 Thread Stephan Ewen
@Caizhi and @Becket - let me reach out to you to jointly debug this issue.

I am wondering if there is some incorrect reporting of failed events?

On Wed, Apr 28, 2021 at 8:53 AM Caizhi Weng  wrote:

> -1
>
> We're testing this version on batch jobs with large (600~1000) parallelisms
> and the following exception messages appear with high frequency:
>
> 2021-04-27 21:27:26
> org.apache.flink.util.FlinkException: An OperatorEvent from an
> OperatorCoordinator to a task was lost. Triggering task failover to ensure
> consistency. Event: '[NoMoreSplitEvent]', targetTask:  -
> execution #0
> at
>
> org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$0(SubtaskGatewayImpl.java:81)
> at
>
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
> at
>
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
> at
>
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> at
>
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> at
>
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
>
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
>
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Becket Qin is investigating this issue.
>


Re: [VOTE] Release 1.13.0, release candidate #2

2021-04-28 Thread Caizhi Weng
-1

We're testing this version on batch jobs with large (600~1000) parallelisms
and the following exception messages appear with high frequency:

2021-04-27 21:27:26
org.apache.flink.util.FlinkException: An OperatorEvent from an
OperatorCoordinator to a task was lost. Triggering task failover to ensure
consistency. Event: '[NoMoreSplitEvent]', targetTask:  -
execution #0
at
org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$0(SubtaskGatewayImpl.java:81)
at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Becket Qin is investigating this issue.