There is a retry for the 'complete' operation - those are erroring out
as well. (DFSClient.java: methodNameToPolicyMap.put("complete",
methodPolicy);)Quite likely it's because the namenode is also a data/task node. -----Original Message----- From: Dhruba Borthakur [mailto:[EMAIL PROTECTED] Sent: Thursday, September 13, 2007 1:38 PM To: [email protected] Subject: RE: ipc.client.timeout Hi Jaydeep, The idea is to retry only those operations that are idempotent. addBlocks and mkdirs are non-idempotent, and that's why they are no retries for these calls. Can you tell me if a CPU bottleneck on your Namenode is causing you to encounter all these timeout? Thanks, dhruba -----Original Message----- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Thursday, September 13, 2007 12:14 PM To: [email protected] Subject: RE: ipc.client.timeout I would love to use a lower timeout. It seems that retries are either buggy or missing in some cases - that cause lots of failures. The cases I can see right now (0.13.1): - namenode.complete: looks like it retries - but may not be idempotent? org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not complete write to file /user/facebook/profiles/binary/users_joined/_task_0018_r_000003_0/.part- 00003.crc by DFSClient_task_0018_r_000003_0 at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:353) - namenode.addBlock: no retry policy (looking at DFSClient.java) - namenode.mkdirs: no retry policy ('') We see plenty of all of these with a lowered timeout. With a high timeout - we have seen very slow recovery from some failures (jobs would hang on submission). Don't understand the fs protocol well enough - any idea if these are fixable? Thx, Joydeep -----Original Message----- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 05, 2007 1:00 AM To: [email protected] Subject: RE: ipc.client.timeout This is to take care of cases where a particular server is too loaded to respond to client RPCs quick enough. Setting the timeout to a large value ensures that RPCs won't timeout that often and thereby potentially lead to lesser failures (for e.g., a map/reduce task kills itself when it fails to invoke an RPC on the tasktracker for three times in a row) and retries. > -----Original Message----- > From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 05, 2007 12:26 PM > To: [email protected] > Subject: ipc.client.timeout > > The default is set to 60s. many of my dfs -put commands would > seem to hang - and lowering the timeout (to 1s) seems to > have made things a whole lot better. > > > > General curiosity - isn't 60s just huge for a rpc timeout? (a > web search indicates that nutch may be setting it to 10s - > and even that seems fairly large). Would love to get a > backgrounder on why the default is set to so large a value .. > > > > Thanks, > > > > Joydeep > >
