[ 
https://issues.apache.org/jira/browse/FLINK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577887#comment-16577887
 ] 

Gary Yao commented on FLINK-10104:
----------------------------------

Hi [~fsimond],

I assume you are using Hortonworks HDP 2.5. I was not able to reproduce your
symptoms on their VM. Then I had a deeper look at the logs, in which I see many
occurrences of:
{noformat}
No open TaskExecutor connection <CONTAINER_ID>. Ignoring close TaskExecutor 
connection.
{noformat}
This is logged in {{ResourceManager#closeTaskManagerConnection}} [1] but
unfortunately we do not log the exception. I suspect that the method is called
from {{YarnResourceManager#onContainersCompleted}} [2]. This method is a 
callback
invoked by YARN when a container completes. Because there is only a single
TaskManager log in your file (the one that succeeded to run the job), I assume
that the containers are stopped for reasons that are outside of Flink's
control (maybe a problem related to your YARN setup).

I would suggest the following things for further troubleshooting: 

* Add improved logging to Flink, and build a custom Flink distribution [3]. For 
example, log the {{ContainerStatus}} instances in {{onContainersCompleted}}. 
The {{ContainerStatus}} has a diagnostics string that can be helpful. 
* If the improved logging does not help, check YARN logs for hints on why the 
containers exited.
* Try deploying using the Apache Hadoop distribution.

Best,
Gary

[1] 
https://github.com/apache/flink/blob/release-1.5.2/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L797

[2] 
https://github.com/apache/flink/blob/release-1.5.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L339

[3] https://ci.apache.org/projects/flink/flink-docs-master/start/building.html



> Job super slow to start
> -----------------------
>
>                 Key: FLINK-10104
>                 URL: https://issues.apache.org/jira/browse/FLINK-10104
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.2
>            Reporter: Florian
>            Priority: Major
>         Attachments: flink2.log
>
>
> Following a discussion on another topic with [~GJL] ( 
> [http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Could-not-build-the-program-from-JAR-file-td22102.html
>  
> )|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Could-not-build-the-program-from-JAR-file-td22102.html]
> It seems that there is a bug as my job is very slow to start.
> I am using Flink to process messages from an input topic, and to redirect 
> them to two output topics, and when I start the job, I have to wait between 5 
> and 10 minutes before I get anything into the output topic. With version 
> 1.4.2, it was much faster.
> I run the job on Yarn, and, as asked by Gary, I attached the results of yarn 
> logs -applicationId <appId>
>  
> Also, as you can notice from the logs, the reported version is 0.1 
> Rev:1a9b648. I have no clue why, as I downloaded the official Flink 1.5.2 
> distribution
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to