[ 
https://issues.apache.org/jira/browse/HBASE-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693770#action_12693770
 ] 

Samuel Guo commented on HBASE-57:
---------------------------------

Thanks for your comments, Jim.

> Solid performance data evaluating the cost of:
> 1) network access to a block in a different rack
> 2) network access to a block in the same rack but on a different server
> 3) network access to a block on the same server
> 4) direct disk access to a block on the same server
> would be highly useful. If there is little difference between 1, 2, 3 (access 
> to a block through a datanode) then
> locality may not be useful. On the other hand, if there is a significant 
> difference between 1, 2, 3 then we should
> try to exploit locality if we can.

> There is a lot of performance evaluation that needs to be done before we 
> actually take the step of using
> locality-based region assignment. If doing that performance evaluation sounds 
> interesting to you, I think
> that would be a great GSOC project.

Yes, I agree with you. We need to do a detail analysis of most behaviors of 
HDFS and HBase before we try locality-based assignment. And the analysis work 
will be the main part of my GSOC project.

> Suppose there was one 'hot' datanode that hosted blocks from many regions. 
> Using locality might end up in
> overloading the region server on that node, resulting in poorer performance.

Yes, Locality should be taken carefully not to overload the  region server or 
the data node.  An ideal region assignment can assign regions close to its data 
to reduce network traffic while balancing the loads between region servers, 
datanodes and avoiding disk competition on the same datanode. As what you 
suggested, we need to know the following things clearly before making it.
1) what is the difference we access data from different locations(local, local 
by-pass, remote, remote rack)?
2) In regions' life time, what is the data-blocks' distribution? And how many 
bytes that the region reads data from local node? how many from remote? from 
remote rack? 
3) After a balance operation happened in HDFS, how 2) changes?
4) After some region servers failed, how 2) changes?

I am not so clear now about how to analysis it. but I think I can take them one 
by one to make things clearly. 

> [hbase] Master should allocate regions to regionservers based upon data 
> locality and rack awareness
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-57
>                 URL: https://issues.apache.org/jira/browse/HBASE-57
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.2.0
>            Reporter: stack
>             Fix For: 0.20.0
>
>
> Currently, regions are assigned regionservers based off a basic loading 
> attribute.  A factor to include in the assignment calcuation is the location 
> of the region in hdfs; i.e. servers hosting region replicas.  If the cluster 
> is such that regionservers are being run on the same nodes as those running 
> hdfs, then ideally the regionserver for a particular region should be running 
> on the same server as hosts a region replica.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to