Hi all, I am working on a research project where we are looking at algorithms to "optimally" distribute data blocks in HDFS nodes. The definition of what is optimal is omitted for brevity.
I want to move specific blocks of a file that is *already* in HDFS. I am able to achieve it using data transfer protocol (took cues from "Balancer" module). But the operation turns out to be very time consuming. In my cluster setup, to move 1 block of data (approximately 60 MB) from data-node-1 to data-node-2 it takes nearly 60 seconds. A "dfs -put" operation that copies the same file from data-node-1's local file system to data-node-2 takes just 1.4 seconds. Any suggestions on how to speed up the movement of specific blocks? Bringing down the running time is very important for us because this operation may happen while executing a job. I am using hadoop-1.0.4 version. Thanks in advance! Best, Karthiek