[
https://issues.apache.org/jira/browse/HDFS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480651#comment-13480651
]
Todd Lipcon commented on HDFS-2802:
-----------------------------------
bq. As regards to consistency (comment 7), a system where snapshot is taken at
the namespace without involving data layer cannot provide string consistency
guarantee. I also think it may not be relevant where writers are different from
the client that is taking the snapshot. Not sure what guarantee such a client
can expect/depend on given writers are separate
There's a continuum between no consistency and strong consistency. In
particular, I think the consistency we should seek to provide is "causal
consistency from the perspective of any single writer". Aaron outlined an
example earlier which is useful for discussing the consistency model:
1) Client issues a write to a file (eg an HBase HLog)
2) Client modifies namespace (eg creates a file)
3) Client issues another write and hflushes
4) Client modifies namespace again
Because a single client issued all 4 operations, we'd like to ensure that any
snapshot, when restored, has a full prefix of these operations - eg just #1, or
1-2, 1-3, or 1-4.
This isn't trivial to implement, but it's also reasonably doable. After some
discussion with Aaron and Colin we came up with a design that does provide
this. Here's a sketch:
- When the client does any namespace modification, the response from the
NameNode returns the transaction ID used to record the modification in the
transaction log. (eg the txn of the OP_MKDIR for a mkdir). The client remembers
the max transaction ID it has seen in a variable inside DFSClient
- When the client issues hflush(), it sticks the transaction ID into the data
packet. This establishes a causality relationship between the namespace layer
and the data layer.
- The datanode maintains a simple (and compact) data structure for open files:
for each byte offset in an open file, keep the transaction ID associated with
the write packet that wrote it. This data structure will be small: it only
needs to be maintained for open files and, since most writers don't often
interleave namespace and data access, it changes rarely. We only need to store
entries in the data structure at the byte offsets where the txid changes.
The process of creating a consistent snapshot then proceeds as follows:
- when the snapshot is created, it is assigned a transaction ID (eg the
transaction id of the OP_MAKE_SNAPSHOT or whatever it is going to be called)
- the Snapshot initially starts in a "in progress" state
- The NN enqueues a command to all datanodes:
ReportSnapshotLengths(snapshot_txid)
- The DNs, upon receiving this command, look at their local data structures to
determine the length of open blocks at the given transaction ID, and report
them back. (even if the length has since grown longer)
- Once the NN has received a reported length for each of the in-progress
blocks, it uses those lengths for the completed snapshot and marks it finalized.
Back of the envelope math seems to indicate that tracking the mapping of txid
to length is feasible, and we think this provides the above consistency
guarantee, which is much stronger than what has been proposed in the design
document. The implementation isn't trivial but also doesn't seem out of reach.
Hope this can spur some discussion on how we can offer stronger semantics.
> Support for RW/RO snapshots in HDFS
> -----------------------------------
>
> Key: HDFS-2802
> URL: https://issues.apache.org/jira/browse/HDFS-2802
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: data-node, name-node
> Reporter: Hari Mankude
> Assignee: Hari Mankude
> Attachments: snap.patch, snapshot-one-pager.pdf, Snapshots20121018.pdf
>
>
> Snapshots are point in time images of parts of the filesystem or the entire
> filesystem. Snapshots can be a read-only or a read-write point in time copy
> of the filesystem. There are several use cases for snapshots in HDFS. I will
> post a detailed write-up soon with with more information.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira