----- Original Message ----- From: Bharath Ravi <bharathra...@gmail.com> Date: Wednesday, October 19, 2011 8:16 am Subject: Re: Load balancing requests in HDFS To: common-dev@hadoop.apache.org
> Thanks a lot Steve! > > ReplicationTargetChooser seems to address load balancing for initially > placing/laying out data, > but it doesn't seem to do active load balancing for incoming > requests to a > datanode: or does it? For every request, ReplicationTargetChooser will check the good targets to write. ( space, traffic, threadcount on DN..etc). DNs will update their statistics by heartbeats. So, NN can check this before actually choosing the taget to write the Data. Hope, this clarifies your doubt. > > Also, would you know if there are statistics on how effective > over-replication is for throughput gain? > Basically, although one might add more replicas, are they actually > usedeffectively to serve incoming requests? here over-replication means uping the replication factor, is it? > > On 18 October 2011 12:37, Steve Loughran <ste...@apache.org> wrote: > > > On 16/10/11 02:53, Bharath Ravi wrote: > > > >> Hi all, > >> > >> I have a question about how HDFS load balances requests for > files/blocks:>> > >> HDFS currently distributes data blocks randomly, for balance. > >> However, if certain files/blocks are more popular than others, > some nodes > >> might get an "unfair" number of requests. > >> Adding more replicas for these popular files might not help, > unless HDFS > >> explicitly distributes requests fairly among the replicas. > >> > > > > Have a look at the ReplicationTargetChooser class; it does take > datanode> load into account, though it's concern is distribution > for data > > availability, not performance. > > > > The standard technique for popular files -including MR job JAR > files- is to > > over-replicate. One problem: how to determine what is popular > without adding > > more load on the namenode > > > > > > -- > Bharath Ravi > Regards, Uma