[
https://issues.apache.org/jira/browse/HDFS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480803#comment-13480803
]
Todd Lipcon commented on HDFS-2802:
-----------------------------------
bq. The design we are proposing is to let DNs send the length. The length known
is what goes into the snapshot instead of recording either zero length for
block under construction or having to initiate communication with
datanodes/implicitly getting it from DN. From what I have heard from some HBase
folks 5-10 seconds lagging should be workable for them. That is why I want to
to talk to few HBase folks in the design review.
I hope I qualify as an HBase folk?
5-10 seconds lagging on the *data* is probably fine. But inconsistency between
metadata and namespace modifications is a lot tougher. Consider for example an
application which uses a write-ahead log on HDFS to make a group of namespace
modifications consistent. See HBASE-2231 for an example of a place where we
currently have a dataloss bug for which the proposed fix is exactly this:
1. Write new files (compaction result)
2. Write to WAL that compaction is finished
3. Delete old files (compaction sources)
On recovery, if we see the "compaction finished" entry in the WAL, then we
"roll forward" the transaction and delete the source. But if the snapshot
doesn't preserve ordering of the above operations, we risk either seeing the
"compaction finished" when the namespace doesn't have the new files, which
would result in an accidental deletion of a bunch of data.
So I think we need a way to provide barriers between namespace and data layer
modifications. The proposal I made above should achieve this.
Another option is something that we've called "super flush". This would be a
flag on hflush() or hsync() indicating that the new length of the file needs to
be persisted to the NameNode, not just the datanodes. It would be used by
applications like HBase to determine consistency points for file lengths.
bq. In fact, communication with DNs when snapshots are being taken will make
the process of taking snapshots very slow while giving very little additional
benefit.
We should distinguish between two types of slowness for snapshots:
1) Slowness while holding a lock. This is unacceptable IMO - we must hold the
lock for a bounded amount of time and never make an RPC while holding the lock.
2) Slowness before a snapshot is available for restore. This is acceptable. For
example, if the user operation "create snapshot" holds the lock for 10ms, but
the snapshot is initially in a "COLLECTING_LENGTHS" state while it waits for
block lengths that seems acceptable. So long as the lengths are filled in by
the next heartbeat (or two heartbeats from now) it should be complete (and thus
ready for recovery) within the minute. Note that we don't need to wait for a
heartbeat from every datanode. Instead, we just need to wait until, for each
under-construction block in the snapshotted area, _one_ of its replicas has
reported. When snapshotting a subtree without any open files, it would still be
instant.
bq. Additionally, including the sizes of non-finalized blocks in snapshots has
implication that if the client dies and the non-finalized section is discarded,
then snapshot might have pointers to non-existent blocks.
I don't understand what you mean here...can you be more specific about the
scenario?
> Support for RW/RO snapshots in HDFS
> -----------------------------------
>
> Key: HDFS-2802
> URL: https://issues.apache.org/jira/browse/HDFS-2802
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: data-node, name-node
> Reporter: Hari Mankude
> Assignee: Hari Mankude
> Attachments: snap.patch, snapshot-one-pager.pdf, Snapshots20121018.pdf
>
>
> Snapshots are point in time images of parts of the filesystem or the entire
> filesystem. Snapshots can be a read-only or a read-write point in time copy
> of the filesystem. There are several use cases for snapshots in HDFS. I will
> post a detailed write-up soon with with more information.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira