[
https://issues.apache.org/jira/browse/HDFS-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026534#comment-16026534
]
Chen Liang commented on HDFS-11886:
-----------------------------------
Thanks [~anu] for looking at this! No decision has been made at all for this
JIRA, any thoughts are more than welcome.
To make sure we are on the same page, did you mean maybe we can have the client
send a "commit" message to KSM after the key is written to datanode, only then
KSM writes that to ksm.db?
If I understand this correctly, I think one thing with this way is that for any
successful putKey, there will always be two calls to KSM guaranteed, one to
allocate block, the other to commit the key. If putKey failed, there will be no
commit and only the first call. While for the revert-failed-key approach, there
is always one call to KSM for successful putKey (which is to allocate block),
but two calls to KSM for failed putKey (revert the key). If assuming putKey is
more likely to succeed then fail, this seems to me a +1 for revert-fail.
However, another thing, is how can we be sure a key is finalized after all. For
the commit-success approach, seems easy: unless that success flag is set, the
key is considered not ready (similar to under construction), but for
revert-failure approach, there will be temporary window where a key actually
failed, but before it is reverted, it has already been read by someone. So
this seems a +1 for commit-success approach.
In short, this probably comes down to do we favor less RPC calls? or do we
favor reliable getKey at any time?
> Ozone : improve error handling for putkey operation
> ---------------------------------------------------
>
> Key: HDFS-11886
> URL: https://issues.apache.org/jira/browse/HDFS-11886
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ozone
> Reporter: Chen Liang
>
> Ozone's putKey operations involve a couple steps:
> 1. KSM calls allocateBlock to SCM, writes this info to KSM's local metastore
> 2. allocatedBlock gets returned to client, client checks to see if container
> needs to be created on datanode, if yes, create the container
> 3. writes the data to container.
> it is possible that 1 succeeded, but 2 or 3 failed, in this case there will
> be an entry in KSM's local metastore, but the key is actually nowhere to be
> found. We need to revert 1 is 2 or 3 failed in this case.
> To resolve this, we need at least two things to be implemented first.
> 1. We need deleteKey() to be added KSM first.
> 2. We also need container reports to be implemented first such that SCM can
> track whether the container is actually added.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]