[
https://issues.apache.org/jira/browse/HBASE-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207636#comment-14207636
]
Weichen Ye commented on HBASE-12394:
------------------------------------
[~stack] Thank you for you review and comments!
1, To support multiple mapper for one region, That`s a very good idea for
improvement. I have a very similar idea, which use size-based way to make
splits for Table. For example, if we have a table with 100 regions, and in
these regions there is one huge region. When we run a job and use this table as
input, 99 mappers will quickly completed, and we have to wait for the last
mapper for a long time. In the current version of HBase, we always use
"manually split large region" way to deal with this data skew issue before we
submit MR job. My idea is to add a config like "max_split_size" in
TableInputFormat, so that large regions could be automatically cut into
multiple splits in MR job. I`ll try to make another patch for this idea, what
do you think about?
2, About the test / release note, sorry for not having them in the last patch.
I`m working on it now.
3, About HBASE-2302, this feature is to exclude some specific regions from the
MR job for some specific reason. I`m not sure in production environments this
feature is rarely used or not. It is really hard to support both this feature
and the new feature in this issue, because multiple regions in one mapper must
be continuous, the mapper only deal with one Scan object.
4, About the code in "If ...else..." way. It is related to the HBASE-2302
issue above. I hope when we use "one mapper one region" mode, HBASE-2302
(exclude specific region from MR job) would be supported; when we use "one
mapper multiple regions" mode, the feature in HBASE-2302 will not be support.
But the duplicated code is always not good. I`ll try to abstract out some code
so that the if branch and else branch can share.
> Support multiple regions as input to each mapper in map/reduce jobs
> -------------------------------------------------------------------
>
> Key: HBASE-12394
> URL: https://issues.apache.org/jira/browse/HBASE-12394
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Affects Versions: 2.0.0, 0.98.6.1
> Reporter: Weichen Ye
> Attachments: HBASE-12394-v2.patch, HBASE-12394-v3.patch,
> HBASE-12394-v4.patch, HBASE-12394.patch
>
>
> Welcome to the ReviewBoard :https://reviews.apache.org/r/27519/
> The Latest Patch is "Diff Revision 2 (Latest)"
> For Hadoop cluster, a job with large HBase table as input always consumes a
> large amount of computing resources. For example, we need to create a job
> with 1000 mappers to scan a table with 1000 regions. This patch is to support
> one mapper using multiple regions as input.
> In order to support multiple regions for one mapper, we need a new property
> in configuration--"hbase.mapreduce.scan.regionspermapper"
> hbase.mapreduce.scan.regionspermapper controls how many regions used as input
> for one mapper. For example,if we have an HBase table with 300 regions, and
> we set hbase.mapreduce.scan.regionspermapper = 3. Then we run a job to scan
> the table, the job will use only 300/3=100 mappers.
> In this way, we can control the number of mappers using the following formula.
> Number of Mappers = (Total region numbers) /
> hbase.mapreduce.scan.regionspermapper
> This is an example of the configuration.
> <property>
> <name>hbase.mapreduce.scan.regionspermapper</name>
> <value>3</value>
> </property>
> This is an example for Java code:
> TableMapReduceUtil.initTableMapperJob(tablename, scan, Map.class, Text.class,
> Text.class, job);
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)