[ https://issues.apache.org/jira/browse/HADOOP-2757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712650#action_12712650 ]
dhruba borthakur commented on HADOOP-2757: ------------------------------------------ > 1. rpc timeout: the patch seems to implement a read timeout not rpc timeout. As an administrator of a cluster, I find it easier to set a time limit for a rpc conection to bail out if it is not receiving response data continuously. I could change it to a true rpcTimeout, but RPCs like "dfsadmin -report" could truly take a long time because the amount of data to be transferred might be huge depending on the size of the cluster. I am comfortable configuring a cluster in such a way that if a rpc client is waiting for more data from the rpc server for more than 30 seconds, then the client can safely assume that the server is non-responsive. This works even for RPCs that have to transfer large amounts of data. Do you agree? > 2. if we have rpc timeout, why we still need soft mount timeout in > leaseChecker? I think we need these two things to be separate. Please see answer to 3a below. > 3 I think the check "if (now > last + softMountTimeout) " could easily be > true in normal cases if renewFrenquency is set to be the soft mount timeout. The code sets renewFrequency to be softMountTimeout/3. So, "if renewFrenquency is set to be the soft mount timeout" cannot happen. But I will modify this portion of code to handle this case better. > 3a. I feel that the meaning of soft mount timeout is not clear maybe The NFS manual says something like this : " The softmount timeout sets the time the NFS client will wait for a request to complete". To make things clearer, this patch keeps two configuration values: ipc.client.inactivity.timeout: is the period of inactivity time when a client is waiting for a response". dfs.softmount.timeout: the max time a DFSClient will wait for a request to successfully complete The ipc.client.inactivity.timeout is set for a single rpc call. The dfs.softmount.timeout applied to FileSystem operations like DFSClient.close(). > 4. In the file close case, would it be better just to limit the number of > retires? In fact, I first deployed a version of code in our cluster that specified the max number of retries to be 5. But then, when I was explaining this behaviour to an app-writer who is writing an app on top of hdfs, it was difficult for me to explain what it really means. I found it easier to explain that "this call will not take more than 30 seconds". Also, specifying a "time" is future proof in a sense that a hdfs developer can change the frequency of close-retries without affecting the semantics exposed to the user. If you feel strongly against this one, I can change it, please do let me know. thanks for reviewing this one. > Should DFS outputstream's close wait forever? > --------------------------------------------- > > Key: HADOOP-2757 > URL: https://issues.apache.org/jira/browse/HADOOP-2757 > Project: Hadoop Core > Issue Type: Improvement > Components: dfs > Reporter: Raghu Angadi > Assignee: dhruba borthakur > Attachments: softMount1.patch, softMount1.patch, softMount2.patch, > softMount3.patch > > > Currently {{DFSOutputStream.close()}} waits for ever if Namenode keeps > throwing {{NotYetReplicated}} exception, for whatever reason. Its pretty > annoying for a user. Shoud the loop inside close have a timeout? If so how > much? It could probably something like 10 minutes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.