[ 
https://issues.apache.org/jira/browse/HDFS-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031649#comment-16031649
 ] 

Anu Engineer commented on HDFS-11886:
-------------------------------------

bq. If you are going to store state in DB, then we need to deal with rollback 
problems and more error handling code 

You bring up excellent points. Let me seize this opportunity to look beyond 
what we are discussing. There are 2 versions of the rollback problem that you 
are referring to. The first one is the case where KSM fails, that is some 
createKey happens and before commitKey is called, KSM fails. In that case what 
you are proposing does make our life easier in short term. 

However, there is another version of this same problem -- createKey without a 
commitKey -- that is when a client creates a key and fails to finish uploading 
the data, at that point KSM will be left holding on to this metadata. Whether 
we maintain that information in memory or a database makes no difference. Since 
the action in the back-end is same; that is run a thread that cleans up 
createKeys without commit that has been open for a long time, say 7 days. 

In other words, While the apparent complexity reduces if you are only focused 
on KSM failure, we just cannot get away from needing a cleanup thread. I am 
very open to any suggestions you might have on how we can tackle this without a 
background thread. If we are doing a background thread anyway, then its 
complexity is independent of persistence.

Also let us look at how HA enabled KSM would work. If we don't write this 
information down, then in case of a KSM failure all KSM metadata is lost, which 
means all in-progress uploads will fail. On the other hand, if we persist this 
information to a Raft ring, then the next leader will be able to find this data 
and respond to the commit key request.  So from an HA point of view, we have to 
persist this data.

[~vagarychen] in an off-line conversation pointed out to me that Loop -- where 
you allocate each block as needed -- can easily be unrolled. Since he knows the 
size of the key that is being put, he can preallocate all those blocks and make 
a single update to KSM metadata if needed. So while the algorithm looks like we 
are going to do an I/O for each block that is being allocated, in reality we 
might be able to optimize it away.

So here are the salient points that compel me to log the KSM metadata to disk

* HA needs it.
* Cleanup thread is needed anyway, even if we have the metadata in memory.
* It is possible to reduce the I/O costs of KSM metadata update --even 
otherwise it is only one I/O per block -- of say 256 MB size. 

So I am still inclined to log this information, rather than keep it in memory. 
Hope that you are able to see the driving forces of why we need to persist this 
information. Yes, I wholeheartedly agree that we have higher I/O costs and one 
of the assumptions of Ozone is that KSM/SCM services have access to a SSD.

bq. And we need to think how to recover state from that when KSM restarted, is 
it necessary or not to recover uncommitted keys in this DB? It will be hard 
because we don't know which stage this key has been processed to.

The proposal is to have a 2 phase operation, that is a key is put with a state 
of the key -- say "under creation" and when commit is called -- it is marked as 
"created". So when we read back we clearly know what is the state of each key. 
If we are using a different DB then all keys in that DB is under creation. So 
either way, it is an easy problem to solve. So when KSM is restarted or due to 
HA, another leader is elected the state on KSMs will be same.  So persisting 
the information actually makes it easy for us to deal with inconsistencies and 
recovery.

> Ozone : improve error handling for putkey operation
> ---------------------------------------------------
>
>                 Key: HDFS-11886
>                 URL: https://issues.apache.org/jira/browse/HDFS-11886
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ozone
>            Reporter: Chen Liang
>         Attachments: design-notes-putkey.pdf
>
>
> Ozone's putKey operations involve a couple steps:
> 1. KSM calls allocateBlock to SCM, writes this info to KSM's local metastore
> 2. allocatedBlock gets returned to client, client checks to see if container 
> needs to be created on datanode, if yes, create the container
> 3. writes the data to container.
> it is possible that 1 succeeded, but 2 or 3 failed, in this case there will 
> be an entry in KSM's local metastore, but the key is actually nowhere to be 
> found. We need to revert 1 is 2 or 3 failed in this case. 
> To resolve this, we need at least two things to be implemented first.
> 1. We need deleteKey() to be added KSM first. 
> 2. We also need container reports to be implemented first such that SCM can 
> track whether the container is actually added.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to