[
https://issues.apache.org/jira/browse/FLINK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577887#comment-16577887
]
Gary Yao commented on FLINK-10104:
----------------------------------
Hi [~fsimond],
I assume you are using Hortonworks HDP 2.5. I was not able to reproduce your
symptoms on their VM. Then I had a deeper look at the logs, in which I see many
occurrences of:
{noformat}
No open TaskExecutor connection <CONTAINER_ID>. Ignoring close TaskExecutor
connection.
{noformat}
This is logged in {{ResourceManager#closeTaskManagerConnection}} [1] but
unfortunately we do not log the exception. I suspect that the method is called
from {{YarnResourceManager#onContainersCompleted}} [2]. This method is a
callback
invoked by YARN when a container completes. Because there is only a single
TaskManager log in your file (the one that succeeded to run the job), I assume
that the containers are stopped for reasons that are outside of Flink's
control (maybe a problem related to your YARN setup).
I would suggest the following things for further troubleshooting:
* Add improved logging to Flink, and build a custom Flink distribution [3]. For
example, log the {{ContainerStatus}} instances in {{onContainersCompleted}}.
The {{ContainerStatus}} has a diagnostics string that can be helpful.
* If the improved logging does not help, check YARN logs for hints on why the
containers exited.
* Try deploying using the Apache Hadoop distribution.
Best,
Gary
[1]
https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L797
[2]
https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L339
[3] https://ci.apache.org/projects/flink/flink-docs-master/start/building.html
> Job super slow to start
> -----------------------
>
> Key: FLINK-10104
> URL: https://issues.apache.org/jira/browse/FLINK-10104
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.5.2
> Reporter: Florian
> Priority: Major
> Attachments: flink2.log
>
>
> Following a discussion on another topic with [~GJL] (
> [http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Could-not-build-the-program-from-JAR-file-td22102.html
>
> )|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Could-not-build-the-program-from-JAR-file-td22102.html]
> It seems that there is a bug as my job is very slow to start.
> I am using Flink to process messages from an input topic, and to redirect
> them to two output topics, and when I start the job, I have to wait between 5
> and 10 minutes before I get anything into the output topic. With version
> 1.4.2, it was much faster.
> I run the job on Yarn, and, as asked by Gary, I attached the results of yarn
> logs -applicationId <appId>
>
> Also, as you can notice from the logs, the reported version is 0.1
> Rev:1a9b648. I have no clue why, as I downloaded the official Flink 1.5.2
> distribution
>
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)