I have a hadoop cluster across 2 racks. One rack contains 12 nodes, the other rack contains 5 nodes.
When I run a really large job, the disks on the 5 nodes fill up much sooner than the disks on the 12 nodes, and I believe it's because the 12 nodes are sending their replicated blocks to the 5-node rack. In fact, my job won't finish successfully, due to full disks on the 5 nodes, even though the overall usage of the cluster is ~75%. Is there a way I can tell hadoop not to enforce the "send replicated blocks outside the current rack" rule?