Hello, I am running Hadoop on my 4 nodes system.
Initially, I pick the replication factor of 2, and nearly 100% of map tasks run in local up to 3 nodes, but the ratio drops to 80% if I use all 4 nodes. As my nodes have quite high I/O bandwidth (24 disks per node), but only limited network bandwidth (1 GigE), this really hampers the scalability. Just for test purpose, I increase the replication factor to 4, and check that input data actually has replication factor of 4 with 'hadoop fs -stat %r%n' but find that the ratio is still around 80% for 4 nodes. How this can happen, and which configuration parameter should I change to fix this? Thank you very much, -seunghwa
