[
https://issues.apache.org/jira/browse/SAMZA-783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092519#comment-15092519
]
Navina Ramesh commented on SAMZA-783:
-------------------------------------
[~djchooy] This could be a symptom of the problem we observed and fixed in
SAMZA-843 . We cannot entirely be sure until we see the configs and the log.
If there is socket timeout when the container reads JobModel from the Job
Coordinator, the container will fail. However, the AM will try to restart the
container on another machine. Unless the same container fails with socket
timeout for more than allowed number of container failures (configured by
yarn.container.retry.count - default is 8), it should not fail the job.
Unfortunately, the fix in SAMZA-843 is not available in 0.10.0 release. If you
are building from master branch and still seeing similar issue, please let us
know. We will investigate further.
> Unable to connect to Job coordinator server
> -------------------------------------------
>
> Key: SAMZA-783
> URL: https://issues.apache.org/jira/browse/SAMZA-783
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.10.0
> Environment: Linux 2.6.32-504.el6.x86_64
> redhat-release-server-6Server-6.6.0.2.el6.x86_64
> Oracle Java 1.8
> Yarn 2.7.1
> Kafka 0.8.2
> Samza 0.10
> Reporter: Edi Bice
>
> The following repeats for every container launched until the job is failed:
> LogType:samza-container-2.log
> Log Upload Time:Wed Sep 30 18:41:45 +0000 2015
> LogLength:3465
> Log Contents:
> 2015-09-30 18:26:40 SamzaContainer$ [INFO] Got container ID: 2
> 2015-09-30 18:26:40 SamzaContainer$ [INFO] Got coordinator URL:
> http://10.49.215.34:38484/
> 2015-09-30 18:26:40 SamzaContainer$ [INFO] Fetching configuration from:
> http://10.49.215.34:38484/
> 2015-09-30 18:27:41 Util$ [ERROR] Unable to connect to Job coordinator
> server, received exception
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1535)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
> at
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at org.apache.samza.util.Util$$anonfun$read$1.apply(Util.scala:131)
> at org.apache.samza.util.Util$$anonfun$read$1.apply(Util.scala:130)
> at
> org.apache.samza.util.ExponentialSleepStrategy.run(ExponentialSleepStrategy.scala:82)
> at org.apache.samza.util.Util$.read(Util.scala:130)
> at
> org.apache.samza.container.SamzaContainer$.readJobModel(SamzaContainer.scala:112)
> at
> org.apache.samza.container.SamzaContainer$.safeMain(SamzaContainer.scala:86)
> at
> org.apache.samza.container.SamzaContainer$.main(SamzaContainer.scala:69)
> at
> org.apache.samza.container.SamzaContainer.main(SamzaContainer.scala)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)