I have a test Hadoop cluster set up using Cloudera. It consists of the Name Node and three Data Nodes. When I submit jobs, they end up piling up on one node instead of round robining through the different nodes.
I understand that Hadoop tries to run the job where the data is located, but with only three data nodes and a replication factor of 3, wouldn't that mean that the same data is on every single machine? Why would it not spread the tasking out over all of the machines instead of clumping up on one, leaving the others idle? Thanks.
