[ https://issues.apache.org/jira/browse/HDFS-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026534#comment-16026534 ]
Chen Liang commented on HDFS-11886: ----------------------------------- Thanks [~anu] for looking at this! No decision has been made at all for this JIRA, any thoughts are more than welcome. To make sure we are on the same page, did you mean maybe we can have the client send a "commit" message to KSM after the key is written to datanode, only then KSM writes that to ksm.db? If I understand this correctly, I think one thing with this way is that for any successful putKey, there will always be two calls to KSM guaranteed, one to allocate block, the other to commit the key. If putKey failed, there will be no commit and only the first call. While for the revert-failed-key approach, there is always one call to KSM for successful putKey (which is to allocate block), but two calls to KSM for failed putKey (revert the key). If assuming putKey is more likely to succeed then fail, this seems to me a +1 for revert-fail. However, another thing, is how can we be sure a key is finalized after all. For the commit-success approach, seems easy: unless that success flag is set, the key is considered not ready (similar to under construction), but for revert-failure approach, there will be temporary window where a key actually failed, but before it is reverted, it has already been read by someone. So this seems a +1 for commit-success approach. In short, this probably comes down to do we favor less RPC calls? or do we favor reliable getKey at any time? > Ozone : improve error handling for putkey operation > --------------------------------------------------- > > Key: HDFS-11886 > URL: https://issues.apache.org/jira/browse/HDFS-11886 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone > Reporter: Chen Liang > > Ozone's putKey operations involve a couple steps: > 1. KSM calls allocateBlock to SCM, writes this info to KSM's local metastore > 2. allocatedBlock gets returned to client, client checks to see if container > needs to be created on datanode, if yes, create the container > 3. writes the data to container. > it is possible that 1 succeeded, but 2 or 3 failed, in this case there will > be an entry in KSM's local metastore, but the key is actually nowhere to be > found. We need to revert 1 is 2 or 3 failed in this case. > To resolve this, we need at least two things to be implemented first. > 1. We need deleteKey() to be added KSM first. > 2. We also need container reports to be implemented first such that SCM can > track whether the container is actually added. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org