[jira] [Commented] (HDFS-2802) Support for RW/RO snapshots in HDFS

Todd Lipcon (JIRA) Sat, 20 Oct 2012 12:02:13 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480803#comment-13480803
 ]


Todd Lipcon commented on HDFS-2802:
-----------------------------------

bq. The design we are proposing is to let DNs send the length. The length known 
is what goes into the snapshot instead of recording either zero length for 
block under construction or having to initiate communication with 
datanodes/implicitly getting it from DN. From what I have heard from some HBase 
folks 5-10 seconds lagging should be workable for them. That is why I want to 
to talk to few HBase folks in the design review.

I hope I qualify as an HBase folk?

5-10 seconds lagging on the *data* is probably fine. But inconsistency between 
metadata and namespace modifications is a lot tougher. Consider for example an 
application which uses a write-ahead log on HDFS to make a group of namespace 
modifications consistent. See HBASE-2231 for an example of a place where we 
currently have a dataloss bug for which the proposed fix is exactly this:

1. Write new files (compaction result)
2. Write to WAL that compaction is finished
3. Delete old files (compaction sources)

On recovery, if we see the "compaction finished" entry in the WAL, then we 
"roll forward" the transaction and delete the source. But if the snapshot 
doesn't preserve ordering of the above operations, we risk either seeing the 
"compaction finished" when the namespace doesn't have the new files, which 
would result in an accidental deletion of a bunch of data.

So I think we need a way to provide barriers between namespace and data layer 
modifications. The proposal I made above should achieve this.

Another option is something that we've called "super flush". This would be a 
flag on hflush() or hsync() indicating that the new length of the file needs to 
be persisted to the NameNode, not just the datanodes. It would be used by 
applications like HBase to determine consistency points for file lengths.


bq. In fact, communication with DNs when snapshots are being taken will make 
the process of taking snapshots very slow while giving very little additional 
benefit.

We should distinguish between two types of slowness for snapshots:
1) Slowness while holding a lock. This is unacceptable IMO - we must hold the 
lock for a bounded amount of time and never make an RPC while holding the lock.
2) Slowness before a snapshot is available for restore. This is acceptable. For 
example, if the user operation "create snapshot" holds the lock for 10ms, but 
the snapshot is initially in a "COLLECTING_LENGTHS" state while it waits for 
block lengths that seems acceptable. So long as the lengths are filled in by 
the next heartbeat (or two heartbeats from now) it should be complete (and thus 
ready for recovery) within the minute. Note that we don't need to wait for a 
heartbeat from every datanode. Instead, we just need to wait until, for each 
under-construction block in the snapshotted area, _one_ of its replicas has 
reported. When snapshotting a subtree without any open files, it would still be 
instant.

bq. Additionally, including the sizes of non-finalized blocks in snapshots has 
implication that if the client dies and the non-finalized section is discarded, 
then snapshot might have pointers to non-existent blocks.

I don't understand what you mean here...can you be more specific about the 
scenario?
                
> Support for RW/RO snapshots in HDFS
> -----------------------------------
>
>                 Key: HDFS-2802
>                 URL: https://issues.apache.org/jira/browse/HDFS-2802
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: data-node, name-node
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>         Attachments: snap.patch, snapshot-one-pager.pdf, Snapshots20121018.pdf
>
>
> Snapshots are point in time images of parts of the filesystem or the entire 
> filesystem. Snapshots can be a read-only or a read-write point in time copy 
> of the filesystem. There are several use cases for snapshots in HDFS. I will 
> post a detailed write-up soon with with more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2802) Support for RW/RO snapshots in HDFS

Reply via email to