[ 
https://issues.apache.org/jira/browse/HBASE-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207636#comment-14207636
 ] 

Weichen Ye commented on HBASE-12394:
------------------------------------

[~stack] Thank you for you review and comments!

1, To support multiple mapper for one region, That`s a very good idea for 
improvement. I have a very similar idea, which use size-based way to make 
splits for Table. For example, if we have a table with 100 regions, and in 
these regions there is one huge region. When we run a job and use this table as 
input, 99 mappers will quickly completed, and we have to wait for the last 
mapper for a long time. In the current version of HBase, we always use 
"manually split large region" way to deal with this data skew issue before we 
submit MR job. My idea is to add a config like "max_split_size" in 
TableInputFormat, so that large regions could be automatically cut into 
multiple splits in MR job. I`ll try to make another patch for this idea, what 
do you think about?

2, About the test / release note, sorry for not having them in the last patch. 
I`m working on it now. 

3, About  HBASE-2302, this feature is to exclude some specific regions from the 
MR job for some specific reason. I`m not sure in production environments this 
feature is rarely used or not. It is really hard to support both this feature 
and the new feature in this issue, because multiple regions in one mapper must 
be continuous, the mapper only deal with one Scan object. 

4, About the code in "If ...else..." way.  It is related to the HBASE-2302 
issue above. I hope when we use "one mapper one region" mode, HBASE-2302 
(exclude specific region from MR job) would be supported; when we use "one 
mapper multiple regions" mode, the feature in HBASE-2302 will not be support. 

But the duplicated code is always not good. I`ll try to abstract out some code 
so that the if branch and else branch can share.


 

> Support multiple regions as input to each mapper in map/reduce jobs
> -------------------------------------------------------------------
>
>                 Key: HBASE-12394
>                 URL: https://issues.apache.org/jira/browse/HBASE-12394
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 2.0.0, 0.98.6.1
>            Reporter: Weichen Ye
>         Attachments: HBASE-12394-v2.patch, HBASE-12394-v3.patch, 
> HBASE-12394-v4.patch, HBASE-12394.patch
>
>
> Welcome to the ReviewBoard :https://reviews.apache.org/r/27519/   
> The Latest Patch is "Diff Revision 2 (Latest)"
> For Hadoop cluster, a job with large HBase table as input always consumes a 
> large amount of computing resources. For example, we need to create a job 
> with 1000 mappers to scan a table with 1000 regions. This patch is to support 
> one mapper using multiple regions as input.
> In order to support multiple regions for one mapper, we need a new property 
> in configuration--"hbase.mapreduce.scan.regionspermapper"
> hbase.mapreduce.scan.regionspermapper controls how many regions used as input 
> for one mapper. For example,if we have an HBase table with 300 regions, and 
> we set hbase.mapreduce.scan.regionspermapper = 3. Then we run a job to scan 
> the table, the job will use only 300/3=100 mappers.
> In this way, we can control the number of mappers using the following formula.
> Number of Mappers = (Total region numbers) / 
> hbase.mapreduce.scan.regionspermapper
> This is an example of the configuration.
> <property>
>      <name>hbase.mapreduce.scan.regionspermapper</name>
>      <value>3</value>
> </property>
> This is an example for Java code:
> TableMapReduceUtil.initTableMapperJob(tablename, scan, Map.class, Text.class, 
> Text.class, job);
>  
>       



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to