[ 
https://issues.apache.org/jira/browse/HDFS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480651#comment-13480651
 ] 

Todd Lipcon commented on HDFS-2802:
-----------------------------------

bq. As regards to consistency (comment 7), a system where snapshot is taken at 
the namespace without involving data layer cannot provide string consistency 
guarantee. I also think it may not be relevant where writers are different from 
the client that is taking the snapshot. Not sure what guarantee such a client 
can expect/depend on given writers are separate

There's a continuum between no consistency and strong consistency. In 
particular, I think the consistency we should seek to provide is "causal 
consistency from the perspective of any single writer". Aaron outlined an 
example earlier which is useful for discussing the consistency model:

1) Client issues a write to a file (eg an HBase HLog)
2) Client modifies namespace (eg creates a file)
3) Client issues another write and hflushes
4) Client modifies namespace again

Because a single client issued all 4 operations, we'd like to ensure that any 
snapshot, when restored, has a full prefix of these operations - eg just #1, or 
1-2, 1-3, or 1-4.

This isn't trivial to implement, but it's also reasonably doable. After some 
discussion with Aaron and Colin we came up with a design that does provide 
this. Here's a sketch:

- When the client does any namespace modification, the response from the 
NameNode returns the transaction ID used to record the modification in the 
transaction log. (eg the txn of the OP_MKDIR for a mkdir). The client remembers 
the max transaction ID it has seen in a variable inside DFSClient
- When the client issues hflush(), it sticks the transaction ID into the data 
packet. This establishes a causality relationship between the namespace layer 
and the data layer.
- The datanode maintains a simple (and compact) data structure for open files: 
for each byte offset in an open file, keep the transaction ID associated with 
the write packet that wrote it. This data structure will be small: it only 
needs to be maintained for open files and, since most writers don't often 
interleave namespace and data access, it changes rarely. We only need to store 
entries in the data structure at the byte offsets where the txid changes.

The process of creating a consistent snapshot then proceeds as follows:
- when the snapshot is created, it is assigned a transaction ID (eg the 
transaction id of the OP_MAKE_SNAPSHOT or whatever it is going to be called)
- the Snapshot initially starts in a "in progress" state
- The NN enqueues a command to all datanodes: 
ReportSnapshotLengths(snapshot_txid)
- The DNs, upon receiving this command, look at their local data structures to 
determine the length of open blocks at the given transaction ID, and report 
them back. (even if the length has since grown longer)
- Once the NN has received a reported length for each of the in-progress 
blocks, it uses those lengths for the completed snapshot and marks it finalized.

Back of the envelope math seems to indicate that tracking the mapping of txid 
to length is feasible, and we think this provides the above consistency 
guarantee, which is much stronger than what has been proposed in the design 
document. The implementation isn't trivial but also doesn't seem out of reach.

Hope this can spur some discussion on how we can offer stronger semantics.
                
> Support for RW/RO snapshots in HDFS
> -----------------------------------
>
>                 Key: HDFS-2802
>                 URL: https://issues.apache.org/jira/browse/HDFS-2802
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: data-node, name-node
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>         Attachments: snap.patch, snapshot-one-pager.pdf, Snapshots20121018.pdf
>
>
> Snapshots are point in time images of parts of the filesystem or the entire 
> filesystem. Snapshots can be a read-only or a read-write point in time copy 
> of the filesystem. There are several use cases for snapshots in HDFS. I will 
> post a detailed write-up soon with with more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to