Hi, 

I am facing a problem where the AM goes done once it is finished but, the API 
keeps on polling to AM for JobStatus/Report and faces SocketTimeout
By default ipc.client.connect.max.retries.on.timeouts is set to 45 which is 
very high in this case. JobClient can very well update this in its 
configuration but, it effects at whole IPC level.

By going through the code same sort of stuff is handled in 
ClientServiceDelegate class by providing a Yarn Level configuration to 
over-ride IPC level configuration. It would be better if the same is done for 
the mentioned property.
The code snippet where the problem is handled in ClientServiceDelegate 
constructor:

this.conf.setInt(
        CommonConfigurationKeysPublic.IPC_CLIENT_CONNECT_MAX_RETRIES_KEY,
        this.conf.getInt(MRJobConfig.MR_CLIENT_TO_AM_IPC_MAX_RETRIES,
            MRJobConfig.DEFAULT_MR_CLIENT_TO_AM_IPC_MAX_RETRIES));

The stack trace for Client keep on polling on SocketTimeout:

Daemon Thread [MrPlanRunner] (Suspended (breakpoint at line 682 in 
Client$Connection))  
        Client$Connection.handleConnectionFailure(int, int, IOException) line: 
682      
        Client$Connection.setupConnection() line: 484   
        Client$Connection.setupIOstreams() line: 565    
        Client$Connection.access$2000(Client$Connection) line: 213      
        Client.getConnection(Client$ConnectionId, Client$Call) line: 1269       
        Client.call(RpcPayloadHeader$RpcKind, Writable, Client$ConnectionId) 
line: 1139 
        Client.call(Writable, Client$ConnectionId) line: 1122   
        ProtoOverHadoopRpcEngine$Invoker.invoke(Object, Method, Object[]) line: 
148     
        $Proxy42.getJobReport(RpcController, 
MRServiceProtos$GetJobReportRequestProto) line: not available      
        MRClientProtocolPBClientImpl.getJobReport(GetJobReportRequest) line: 
111        
        GeneratedMethodAccessor11.invoke(Object, Object[]) line: not available  
        DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 25  
        Method.invoke(Object, Object...) line: 597      
        ClientServiceDelegate.invoke(String, Class, Object) line: 296   
        ClientServiceDelegate.getJobStatus(JobID) line: 373     
        YARNRunner.getJobStatus(JobID) line: 483        
        Job$1.run() line: 322   
        Job$1.run() line: 319   
        AccessController.doPrivileged(PrivilegedExceptionAction<T>, 
AccessControlContext) line: not available [native method]   
        Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 396   
        UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1177      
        Job.updateStatus() line: 319    
        Job.isComplete() line: 598      
        Job.monitorAndPrintJob() line: 1280     
        JobClient$NetworkedJob.monitorAndPrintJob() line: 432   
        JobClient.monitorAndPrintJob(JobConf, RunningJob) line: 902     
        

Any thoughts!!!!!!


Cheers,
Subroto Sanyal

Reply via email to