Hi, I am facing a problem where the AM goes done once it is finished but, the API keeps on polling to AM for JobStatus/Report and faces SocketTimeout By default ipc.client.connect.max.retries.on.timeouts is set to 45 which is very high in this case. JobClient can very well update this in its configuration but, it effects at whole IPC level.
By going through the code same sort of stuff is handled in ClientServiceDelegate class by providing a Yarn Level configuration to over-ride IPC level configuration. It would be better if the same is done for the mentioned property. The code snippet where the problem is handled in ClientServiceDelegate constructor: this.conf.setInt( CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_KEY, this.conf.getInt(MRJobConfig.MR_CLIENT_TO_AM_IPC_MAX_RETRIES, MRJobConfig.DEFAULT_MR_CLIENT_TO_AM_IPC_MAX_RETRIES)); The stack trace for Client keep on polling on SocketTimeout: Daemon Thread [MrPlanRunner] (Suspended (breakpoint at line 682 in Client$Connection)) Client$Connection.handleConnectionFailure(int, int, IOException) line: 682 Client$Connection.setupConnection() line: 484 Client$Connection.setupIOstreams() line: 565 Client$Connection.access$2000(Client$Connection) line: 213 Client.getConnection(Client$ConnectionId, Client$Call) line: 1269 Client.call(RpcPayloadHeader$RpcKind, Writable, Client$ConnectionId) line: 1139 Client.call(Writable, Client$ConnectionId) line: 1122 ProtoOverHadoopRpcEngine$Invoker.invoke(Object, Method, Object[]) line: 148 $Proxy42.getJobReport(RpcController, MRServiceProtos$GetJobReportRequestProto) line: not available MRClientProtocolPBClientImpl.getJobReport(GetJobReportRequest) line: 111 GeneratedMethodAccessor11.invoke(Object, Object[]) line: not available DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25 Method.invoke(Object, Object...) line: 597 ClientServiceDelegate.invoke(String, Class, Object) line: 296 ClientServiceDelegate.getJobStatus(JobID) line: 373 YARNRunner.getJobStatus(JobID) line: 483 Job$1.run() line: 322 Job$1.run() line: 319 AccessController.doPrivileged(PrivilegedExceptionAction<T>, AccessControlContext) line: not available [native method] Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 396 UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1177 Job.updateStatus() line: 319 Job.isComplete() line: 598 Job.monitorAndPrintJob() line: 1280 JobClient$NetworkedJob.monitorAndPrintJob() line: 432 JobClient.monitorAndPrintJob(JobConf, RunningJob) line: 902 Any thoughts!!!!!! Cheers, Subroto Sanyal