Hello everyone. I have experienced a very strange situation about HDFS operation.
I have a 1 master and 10 slaves cluster environment. When I put a file A into HDFS with dfs.replication=10, I can see every block of the file A is replicated in every node. So, it is reasonable to think that HDFS file reader can operate as local block reader when I want to read that file A. However, when I execute hdfs dfs –copyToLocal A /to/my/localDir, the file reading time is same as the case of dfs.replication=1. So, I moniter the network resources especially read and write data. Both two cases that dfs.replication={1, 10} fully exploit network resources.. This means reading that file does not consider the block location.. Is it reasonable operation of HDFS? Then, what is the true meaning of data locality in HDFS? (We all know about the data locality of map task..) I want to know the reason of the same performance between both two “copyToLocal” cases. Thanks!Yoonmin // Yoonmin Nam