This is done on purpose to improve the write performance. In practice, we run map/reduce jobs on the cluster so every node in the cluster gets an equal chance of writing. A single node data uploading as described in your email is normally carried out at an off-cluster node. So imbalanced data distribution should not be a problem.
Hairong -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 22, 2007 4:18 PM To: [email protected] Subject: question on HDFS block distribution hi guys, when a file being copied to HDFS, it seems that HDFS always writes the first copy of a block to the data node running on the machine that invoked the copy, and the data nodes for the replicas are selected evenly from the remaining data nodes. so, for example, on a 5 node cluster with replication factor set to 2, if i copy a N-byte file from node 1, then node 1 will use up N bytes and nodes 2,3,4,5 will use up N/4 bytes each. is this a known issue, or there any way to configure HDFS so that the blocks are distributed evenly (so with each node using up 2*N/5 bytes in this case)? thanks, --------------------------------- Get the free Yahoo! toolbar and rest assured with the added security of spyware protection.
