[ 
https://issues.apache.org/jira/browse/SAMZA-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092519#comment-15092519
 ] 

Navina Ramesh commented on SAMZA-783:
-------------------------------------

[~djchooy] This could be a symptom of the problem we observed and fixed in 
SAMZA-843 . We cannot entirely be sure until we see the configs and the log. 

If there is socket timeout when the container reads JobModel from the Job 
Coordinator, the container will fail. However, the AM will try to restart the 
container on another machine. Unless the same container fails with socket 
timeout for more than allowed number of container failures (configured by 
yarn.container.retry.count - default is 8), it should not fail the job. 

Unfortunately, the fix in SAMZA-843 is not available in 0.10.0 release. If you 
are building from master branch and still seeing similar issue, please let us 
know. We will investigate further. 

> Unable to connect to Job coordinator server
> -------------------------------------------
>
>                 Key: SAMZA-783
>                 URL: https://issues.apache.org/jira/browse/SAMZA-783
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.10.0
>         Environment: Linux 2.6.32-504.el6.x86_64
> redhat-release-server-6Server-6.6.0.2.el6.x86_64
> Oracle Java 1.8
> Yarn 2.7.1
> Kafka 0.8.2
> Samza 0.10
>            Reporter: Edi Bice
>
> The following repeats for every container launched until the job is failed:
> LogType:samza-container-2.log
> Log Upload Time:Wed Sep 30 18:41:45 +0000 2015
> LogLength:3465
> Log Contents:
> 2015-09-30 18:26:40 SamzaContainer$ [INFO] Got container ID: 2
> 2015-09-30 18:26:40 SamzaContainer$ [INFO] Got coordinator URL: 
> http://10.49.215.34:38484/
> 2015-09-30 18:26:40 SamzaContainer$ [INFO] Fetching configuration from: 
> http://10.49.215.34:38484/
> 2015-09-30 18:27:41 Util$ [ERROR] Unable to connect to Job coordinator 
> server, received exception
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:170)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>         at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
>         at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1535)
>         at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
>         at 
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
>         at org.apache.samza.util.Util$$anonfun$read$1.apply(Util.scala:131)
>         at org.apache.samza.util.Util$$anonfun$read$1.apply(Util.scala:130)
>         at 
> org.apache.samza.util.ExponentialSleepStrategy.run(ExponentialSleepStrategy.scala:82)
>         at org.apache.samza.util.Util$.read(Util.scala:130)
>         at 
> org.apache.samza.container.SamzaContainer$.readJobModel(SamzaContainer.scala:112)
>         at 
> org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:86)
>         at 
> org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:69)
>         at 
> org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to