[ 
https://issues.apache.org/jira/browse/HDFS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480757#comment-13480757
 ] 

Todd Lipcon commented on HDFS-2802:
-----------------------------------

Hi Suresh. Yes, I read the design there. In fact I think the design is based my 
comment on HDFS-3960 from a few weeks ago. But after further thinking, I think 
that design is too weak. Here's why:

If the first DN in the pipeline has to RPC on every hflush, that would be way 
too many RPCs. HBase for example flushes several hundred times per second per 
server, so a 1000 node HBase cluster under heavy load would quickly take down a 
NameNode. So instead of the DN immediately RPCing on every hflush, it has to 
wait until the next heartbeat and report lengths with the heartbeat.

Given this, it may be 5-10 seconds between the hflush and the report of the 
length to the datanode. This means that the snapshot will get a length which is 
either 5-10 seconds too old or 5-10 seconds too new (depending on whether it 
uses the last reported length or if it waits until the next heartbeat to 
finalize the snapshot)

A 5-10 second inconsistency window is plenty to break the situation described 
above: it's quite likely to get data layer modifications 1 and 3 wthout getting 
namespace modifications 2 and 4, or vice versa.

On the other hand, the design I proposed above _does_ handle this, because the 
DN isn't reporting the length at the time of heartbeat. Instead it's reporting 
a length which is causally consistent with the namespace from the perspective 
of the writer of that file.

bq. Todd, another option is to look at the inodesUnderConstruction in the NN 
and query the DNs for the exact filesize at the time of taking snapshot

We can't query the DNs while holding the NN lock. It could take several seconds 
or longer to contact all the DNs in a loaded 1000+ node cluster, and 
potentially 10s of seconds if one of the nodes is actually down. So you'd have 
to drop the lock, at which point we're back to the above issue with consistency 
against concurrent NS modifications.

bq.  A better approach is to have the application reach a quiesce point and 
then take a snap. This is normally done for oracle (hot backup mode) and 
sqlserver so that an application consistent snapshot can be taken.

The difference is that quiescing a single-node or small-cluster database like 
SQL Server or RAC is relatively easy. On the other hand, quiescing a 1000 node 
HBase cluster would take a while, and I don't think users will really tolerate 
a global stop-the-world to make a snapshot. This is especially true for use 
cases like DR/backup where you expect to take snapshots as often as once every 
few minutes.
                
> Support for RW/RO snapshots in HDFS
> -----------------------------------
>
>                 Key: HDFS-2802
>                 URL: https://issues.apache.org/jira/browse/HDFS-2802
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: data-node, name-node
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>         Attachments: snap.patch, snapshot-one-pager.pdf, Snapshots20121018.pdf
>
>
> Snapshots are point in time images of parts of the filesystem or the entire 
> filesystem. Snapshots can be a read-only or a read-write point in time copy 
> of the filesystem. There are several use cases for snapshots in HDFS. I will 
> post a detailed write-up soon with with more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to