[
https://issues.apache.org/jira/browse/HDFS-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480757#comment-13480757
]
Todd Lipcon commented on HDFS-2802:
-----------------------------------
Hi Suresh. Yes, I read the design there. In fact I think the design is based my
comment on HDFS-3960 from a few weeks ago. But after further thinking, I think
that design is too weak. Here's why:
If the first DN in the pipeline has to RPC on every hflush, that would be way
too many RPCs. HBase for example flushes several hundred times per second per
server, so a 1000 node HBase cluster under heavy load would quickly take down a
NameNode. So instead of the DN immediately RPCing on every hflush, it has to
wait until the next heartbeat and report lengths with the heartbeat.
Given this, it may be 5-10 seconds between the hflush and the report of the
length to the datanode. This means that the snapshot will get a length which is
either 5-10 seconds too old or 5-10 seconds too new (depending on whether it
uses the last reported length or if it waits until the next heartbeat to
finalize the snapshot)
A 5-10 second inconsistency window is plenty to break the situation described
above: it's quite likely to get data layer modifications 1 and 3 wthout getting
namespace modifications 2 and 4, or vice versa.
On the other hand, the design I proposed above _does_ handle this, because the
DN isn't reporting the length at the time of heartbeat. Instead it's reporting
a length which is causally consistent with the namespace from the perspective
of the writer of that file.
bq. Todd, another option is to look at the inodesUnderConstruction in the NN
and query the DNs for the exact filesize at the time of taking snapshot
We can't query the DNs while holding the NN lock. It could take several seconds
or longer to contact all the DNs in a loaded 1000+ node cluster, and
potentially 10s of seconds if one of the nodes is actually down. So you'd have
to drop the lock, at which point we're back to the above issue with consistency
against concurrent NS modifications.
bq. A better approach is to have the application reach a quiesce point and
then take a snap. This is normally done for oracle (hot backup mode) and
sqlserver so that an application consistent snapshot can be taken.
The difference is that quiescing a single-node or small-cluster database like
SQL Server or RAC is relatively easy. On the other hand, quiescing a 1000 node
HBase cluster would take a while, and I don't think users will really tolerate
a global stop-the-world to make a snapshot. This is especially true for use
cases like DR/backup where you expect to take snapshots as often as once every
few minutes.
> Support for RW/RO snapshots in HDFS
> -----------------------------------
>
> Key: HDFS-2802
> URL: https://issues.apache.org/jira/browse/HDFS-2802
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: data-node, name-node
> Reporter: Hari Mankude
> Assignee: Hari Mankude
> Attachments: snap.patch, snapshot-one-pager.pdf, Snapshots20121018.pdf
>
>
> Snapshots are point in time images of parts of the filesystem or the entire
> filesystem. Snapshots can be a read-only or a read-write point in time copy
> of the filesystem. There are several use cases for snapshots in HDFS. I will
> post a detailed write-up soon with with more information.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira