Can idempotency be retro-fitted with a client generated random key? That way the server can remember recent transaction keys (say the last minute of keys) and ignore redundant requests.
On 9/13/07 1:37 PM, "Dhruba Borthakur" <[EMAIL PROTECTED]> wrote: > Hi Jaydeep, > > The idea is to retry only those operations that are idempotent. addBlocks > and mkdirs are non-idempotent, and that's why they are no retries for these > calls. > > Can you tell me if a CPU bottleneck on your Namenode is causing you to > encounter all these timeout? > > Thanks, > dhruba > > > -----Original Message----- > From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 13, 2007 12:14 PM > To: [email protected] > Subject: RE: ipc.client.timeout > > I would love to use a lower timeout. It seems that retries are either > buggy or missing in some cases - that cause lots of failures. The cases > I can see right now (0.13.1): > > - namenode.complete: looks like it retries - but may not be idempotent? > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not > complete write to file > /user/facebook/profiles/binary/users_joined/_task_0018_r_000003_0/.part- > 00003.crc by DFSClient_task_0018_r_000003_0 > at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:353) > > > - namenode.addBlock: no retry policy (looking at DFSClient.java) > - namenode.mkdirs: no retry policy ('') > > We see plenty of all of these with a lowered timeout. With a high > timeout - we have seen very slow recovery from some failures (jobs would > hang on submission). > > Don't understand the fs protocol well enough - any idea if these are > fixable? > > Thx, > > Joydeep > > -----Original Message----- > From: Devaraj Das [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 05, 2007 1:00 AM > To: [email protected] > Subject: RE: ipc.client.timeout > > This is to take care of cases where a particular server is too loaded to > respond to client RPCs quick enough. Setting the timeout to a large > value > ensures that RPCs won't timeout that often and thereby potentially lead > to > lesser failures (for e.g., a map/reduce task kills itself when it fails > to > invoke an RPC on the tasktracker for three times in a row) and retries. > >> -----Original Message----- >> From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, September 05, 2007 12:26 PM >> To: [email protected] >> Subject: ipc.client.timeout >> >> The default is set to 60s. many of my dfs -put commands would >> seem to hang - and lowering the timeout (to 1s) seems to >> have made things a whole lot better. >> >> >> >> General curiosity - isn't 60s just huge for a rpc timeout? (a >> web search indicates that nutch may be setting it to 10s - >> and even that seems fairly large). Would love to get a >> backgrounder on why the default is set to so large a value .. >> >> >> >> Thanks, >> >> >> >> Joydeep >> >> > >
