[ https://issues.apache.org/jira/browse/HDFS-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031649#comment-16031649 ]
Anu Engineer commented on HDFS-11886: ------------------------------------- bq. If you are going to store state in DB, then we need to deal with rollback problems and more error handling code You bring up excellent points. Let me seize this opportunity to look beyond what we are discussing. There are 2 versions of the rollback problem that you are referring to. The first one is the case where KSM fails, that is some createKey happens and before commitKey is called, KSM fails. In that case what you are proposing does make our life easier in short term. However, there is another version of this same problem -- createKey without a commitKey -- that is when a client creates a key and fails to finish uploading the data, at that point KSM will be left holding on to this metadata. Whether we maintain that information in memory or a database makes no difference. Since the action in the back-end is same; that is run a thread that cleans up createKeys without commit that has been open for a long time, say 7 days. In other words, While the apparent complexity reduces if you are only focused on KSM failure, we just cannot get away from needing a cleanup thread. I am very open to any suggestions you might have on how we can tackle this without a background thread. If we are doing a background thread anyway, then its complexity is independent of persistence. Also let us look at how HA enabled KSM would work. If we don't write this information down, then in case of a KSM failure all KSM metadata is lost, which means all in-progress uploads will fail. On the other hand, if we persist this information to a Raft ring, then the next leader will be able to find this data and respond to the commit key request. So from an HA point of view, we have to persist this data. [~vagarychen] in an off-line conversation pointed out to me that Loop -- where you allocate each block as needed -- can easily be unrolled. Since he knows the size of the key that is being put, he can preallocate all those blocks and make a single update to KSM metadata if needed. So while the algorithm looks like we are going to do an I/O for each block that is being allocated, in reality we might be able to optimize it away. So here are the salient points that compel me to log the KSM metadata to disk * HA needs it. * Cleanup thread is needed anyway, even if we have the metadata in memory. * It is possible to reduce the I/O costs of KSM metadata update --even otherwise it is only one I/O per block -- of say 256 MB size. So I am still inclined to log this information, rather than keep it in memory. Hope that you are able to see the driving forces of why we need to persist this information. Yes, I wholeheartedly agree that we have higher I/O costs and one of the assumptions of Ozone is that KSM/SCM services have access to a SSD. bq. And we need to think how to recover state from that when KSM restarted, is it necessary or not to recover uncommitted keys in this DB? It will be hard because we don't know which stage this key has been processed to. The proposal is to have a 2 phase operation, that is a key is put with a state of the key -- say "under creation" and when commit is called -- it is marked as "created". So when we read back we clearly know what is the state of each key. If we are using a different DB then all keys in that DB is under creation. So either way, it is an easy problem to solve. So when KSM is restarted or due to HA, another leader is elected the state on KSMs will be same. So persisting the information actually makes it easy for us to deal with inconsistencies and recovery. > Ozone : improve error handling for putkey operation > --------------------------------------------------- > > Key: HDFS-11886 > URL: https://issues.apache.org/jira/browse/HDFS-11886 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone > Reporter: Chen Liang > Attachments: design-notes-putkey.pdf > > > Ozone's putKey operations involve a couple steps: > 1. KSM calls allocateBlock to SCM, writes this info to KSM's local metastore > 2. allocatedBlock gets returned to client, client checks to see if container > needs to be created on datanode, if yes, create the container > 3. writes the data to container. > it is possible that 1 succeeded, but 2 or 3 failed, in this case there will > be an entry in KSM's local metastore, but the key is actually nowhere to be > found. We need to revert 1 is 2 or 3 failed in this case. > To resolve this, we need at least two things to be implemented first. > 1. We need deleteKey() to be added KSM first. > 2. We also need container reports to be implemented first such that SCM can > track whether the container is actually added. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org