HDFS internal mechanism questions

Sean Bigdatafun Thu, 07 Oct 2010 22:09:23 -0700

Is there a pointer where I can find details of the write path in HDFS? In
particular, I'd like to get some technical figures describing the following
puzzle in my mind:


   * Is there a 64KB block-wise checksum within the 64MB blocks (as
described in Section 5.2 in the GFS paper)? or HDFS keeps a whole-block (64
MB) wise single checksum?

   *  It seems that HDFS' staging strategy," In fact, initially the HDFS
client caches the file data into a temporary local file. Application writes
are transparently redirected to this temporary local file" , is quite
different from the original GFS paper (see Section 2.3 of GFS paper "neither
client nor the chunkserver caches file data"). Can someone help me
understanding it ?

   *  Both HDFS document and GFS paper mentioned that Namenode poll
Datanodes periodically (BlockReport) to get their most up-to-date
information. Can someone tell me what exact info "BlockReport" contain or
tell me the class name that I can look up in the Javadoc? Plus, is the
block-id treated as file name in the datanode's local filesystem? Here is my
guess-standing:
   --- 1)  I think the reason why losing Namenode metadata can cause HDFS
cluster data total loss is because "BlockReport" does not contain the
mapping between a HDFS filename and the block-ids (otherwise, the polled
data may be sufficient to reconstruct the overall HDFS metadata view), so
I'd like to understand more details.
   --- 2)  Namenode's metadata contains "{filename, n-th block} -->
block-id", and serve as the final authority (from checkpoint and edit log).
But the metadata does not contain "block-id --> {machineA, machineB,
machineC}" -- instead, it waits for the BlockReport info from Datanodes.

Sean

HDFS internal mechanism questions

Reply via email to