[11/50] [abbrv] hadoop git commit: HDFS-8326. Documentation about when checkpoints are run is out of date. (Misty Stanley-Jones via xyao)

jitendra Mon, 11 May 2015 15:55:39 -0700

HDFS-8326. Documentation about when checkpoints are run is out of date. (Misty 
Stanley-Jones via xyao)



Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/d0e75e60
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/d0e75e60
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/d0e75e60

Branch: refs/heads/HDFS-7240
Commit: d0e75e60fb16ffd6c95648a06ff3958722f71e4d
Parents: f7a74d2
Author: Xiaoyu Yao <[email protected]>
Authored: Fri May 8 14:46:25 2015 -0700
Committer: Xiaoyu Yao <[email protected]>
Committed: Fri May 8 14:46:37 2015 -0700

----------------------------------------------------------------------
 hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt                     | 3 +++
 hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDesign.md | 5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/d0e75e60/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
----------------------------------------------------------------------
diff --git a/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt 
b/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
index fedef79..b766e26 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
+++ b/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
@@ -720,6 +720,9 @@ Release 2.8.0 - UNRELEASED
     HDFS-8311. DataStreamer.transfer() should timeout the socket InputStream.
     (Esteban Gutierrez via Yongjun Zhang)
 
+    HDFS-8326. Documentation about when checkpoints are run is out of date.
+    (Misty Stanley-Jones via xyao)
+
 Release 2.7.1 - UNRELEASED
 
   INCOMPATIBLE CHANGES

http://git-wip-us.apache.org/repos/asf/hadoop/blob/d0e75e60/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDesign.md
----------------------------------------------------------------------
diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDesign.md 
b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDesign.md
index 5a8e366..a30877a 100644
--- a/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDesign.md
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/HdfsDesign.md
@@ -135,9 +135,10 @@ The Persistence of File System Metadata
 
 The HDFS namespace is stored by the NameNode. The NameNode uses a transaction 
log called the EditLog to persistently record every change that occurs to file 
system metadata. For example, creating a new file in HDFS causes the NameNode 
to insert a record into the EditLog indicating this. Similarly, changing the 
replication factor of a file causes a new record to be inserted into the 
EditLog. The NameNode uses a file in its local host OS file system to store the 
EditLog. The entire file system namespace, including the mapping of blocks to 
files and file system properties, is stored in a file called the FsImage. The 
FsImage is stored as a file in the NameNodeâs local file system too.
 
-The NameNode keeps an image of the entire file system namespace and file 
Blockmap in memory. This key metadata item is designed to be compact, such that 
a NameNode with 4 GB of RAM is plenty to support a huge number of files and 
directories. When the NameNode starts up, it reads the FsImage and EditLog from 
disk, applies all the transactions from the EditLog to the in-memory 
representation of the FsImage, and flushes out this new version into a new 
FsImage on disk. It can then truncate the old EditLog because its transactions 
have been applied to the persistent FsImage. This process is called a 
checkpoint. In the current implementation, a checkpoint only occurs when the 
NameNode starts up. Work is in progress to support periodic checkpointing in 
the near future.
+The NameNode keeps an image of the entire file system namespace and file 
Blockmap in memory. When the NameNode starts up, or a checkpoint is triggered 
by a configurable threshold, it reads the FsImage and EditLog from disk, 
applies all the transactions from the EditLog to the in-memory representation 
of the FsImage, and flushes out this new version into a new FsImage on disk. It 
can then truncate the old EditLog because its transactions have been applied to 
the persistent FsImage. This process is called a checkpoint. The purpose of a 
checkpoint is to make sure that HDFS has a consistent view of the file system 
metadata by taking a snapshot of the file system metadata and saving it to 
FsImage. Even though it is efficient to read a FsImage, it is not efficient to 
make incremental edits directly to a FsImage. Instead of modifying FsImage for 
each edit, we persist the edits in the Editlog. During the checkpoint the 
changes from Editlog are applied to the FsImage. A checkpoint can be tri
 ggered at a given time interval (`dfs.namenode.checkpoint.period`) expressed 
in seconds, or after a given number of filesystem transactions have accumulated 
(`dfs.namenode.checkpoint.txns`). If both of these properties are set, the 
first threshold to be reached triggers a checkpoint.
+
+The DataNode stores HDFS data in files in its local file system. The DataNode 
has no knowledge about HDFS files. It stores each block of HDFS data in a 
separate file in its local file system. The DataNode does not create all files 
in the same directory. Instead, it uses a heuristic to determine the optimal 
number of files per directory and creates subdirectories appropriately.  It is 
not optimal to create all local files in the same directory because the local 
file system might not be able to efficiently support a huge number of files in 
a single directory. When a DataNode starts up, it scans through its local file 
system, generates a list of all HDFS data blocks that correspond to each of 
these local files, and sends this report to the NameNode. The report is called 
the _Blockreport_.
 
-The DataNode stores HDFS data in files in its local file system. The DataNode 
has no knowledge about HDFS files. It stores each block of HDFS data in a 
separate file in its local file system. The DataNode does not create all files 
in the same directory. Instead, it uses a heuristic to determine the optimal 
number of files per directory and creates subdirectories appropriately. It is 
not optimal to create all local files in the same directory because the local 
file system might not be able to efficiently support a huge number of files in 
a single directory. When a DataNode starts up, it scans through its local file 
system, generates a list of all HDFS data blocks that correspond to each of 
these local files and sends this report to the NameNode: this is the 
Blockreport.
 
 The Communication Protocols
 ---------------------------

[11/50] [abbrv] hadoop git commit: HDFS-8326. Documentation about when checkpoints are run is out of date. (Misty Stanley-Jones via xyao)

Reply via email to