Re: Loading data into HDFS

Dennis Kubes Thu, 02 Aug 2007 23:57:34 -0700

You can copy data from any node, so if you can do it from multiple nodesyour performance would be better (although be sure not to overlapfiles). The master node is updated once a the block is copied itreplication number of times. So if default replication is 3 then the 3replicates must be active before the master is updated and the data"appears" int the dfs.

How long the updates take to happen is a function of your server loadand network speed and file size. Generally it is fast.

So the process is the data is loaded into the dfs, replicates arecreated, and the master node is updated. In terms of consistency, ifthe data node crashes before the data is loaded then the data won'tappear in the dfs. If the name node crashes before it is updated butall replicates are active, the data would appear once the name node hasbeen fixed and updated through block reports. If a single node crashesthat has a replicate once the namenode has been updated then the datawill be replicated from one of the other 2 replicates to another 3system if available.


Dennis Kubes

Venkates .P.B. wrote:

Am I missing something very fundamental ? Can someone comment on these
queries ?

Thanks,
Venkates P B

On 8/1/07, Venkates .P.B. <[EMAIL PROTECTED]> wrote:


Few queries regarding the way data is loaded into HDFS.

-Is it a common practice to load the data into HDFS only through the
master node ? We are able to copy only around 35 logs (64K each) per minute
in a 2 slave configuration.

-We are concerned about time it would take to update filenames and block
maps in the master node when data is loaded from few/all the slave nodes.
Can anyone let me know how long generally it takes for this update to
happen.

And one more question, what if the node crashes soon after the data is
copied into one it. How is data consistency maintained here ?

Thanks in advance,
Venkates P B

Re: Loading data into HDFS

Reply via email to