Thanks a lot Steve! ReplicationTargetChooser seems to address load balancing for initially placing/laying out data, but it doesn't seem to do active load balancing for incoming requests to a datanode: or does it?
Also, would you know if there are statistics on how effective over-replication is for throughput gain? Basically, although one might add more replicas, are they actually used effectively to serve incoming requests? On 18 October 2011 12:37, Steve Loughran <ste...@apache.org> wrote: > On 16/10/11 02:53, Bharath Ravi wrote: > >> Hi all, >> >> I have a question about how HDFS load balances requests for files/blocks: >> >> HDFS currently distributes data blocks randomly, for balance. >> However, if certain files/blocks are more popular than others, some nodes >> might get an "unfair" number of requests. >> Adding more replicas for these popular files might not help, unless HDFS >> explicitly distributes requests fairly among the replicas. >> > > Have a look at the ReplicationTargetChooser class; it does take datanode > load into account, though it's concern is distribution for data > availability, not performance. > > The standard technique for popular files -including MR job JAR files- is to > over-replicate. One problem: how to determine what is popular without adding > more load on the namenode > -- Bharath Ravi