[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964924#comment-13964924
 ] 

Yu Li commented on HBASE-10932:
-------------------------------

Hi [~jdcryans]
{quote}
What makes RowCounter so special that it's the only MR job that would 
beneficiate from this functionality?
I was pointing at the old TableInputFormatBase to show that it used to do this, 
and that the new one doesn't do it 
{quote}
Ok, got your point now. And yes, we could remove the special InputFormat for 
RowCounter and _*fix*_ the new TableInputFormatBase. I created the special 
InputFormat for RowCounter just because from the comments of the new 
TableInputFormatBase's getSplits method, I thought it's designed for purpose to 
make each mapper just scan one single region...

{quote}
I'm guessing because MR doesn't pass mapred.map.tasks as a hint anymore
{quote}
In my understanding, it still passes mapred.map.tasks as a hint, only that the 
param is contained in the JobContext, so no need of a special int param for 
getSplits any more.
Regarding the parameter to pass the mapred.map.tasks hint, I'm referring to 
distcp command, it has a special "-m" param there:
{noformat}
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
...
-m <arg>               Max number of concurrent maps to use for copy
{noformat}

{quote}
Well there's nothing preventing the JobTracker from filling up 4 machines and 
leave one quiet
{quote}
Oh, there's some misunderstanding here. While talking about "real burden for 
the HBase cluster", I didn't mean CPU burden caused by MR job but IO burden 
caused by scan requests. If we have 25 mappers there would be 25 scan requests, 
while w/ 20 mappers there would only be 20 scan requests. This is useful 
especially in multi-tenant env, when we need to check data integrity for one 
user after data importing meanwhile don't want the scan burden to slow down RT 
of other users' request. Makes sense? :-)

> Improve RowCounter to allow mapper number set/control
> -----------------------------------------------------
>
>                 Key: HBASE-10932
>                 URL: https://issues.apache.org/jira/browse/HBASE-10932
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Minor
>         Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch
>
>
> The typical use case of RowCounter is to do some kind of data integrity 
> checking, like after exporting some data from RDBMS to HBase, or from one 
> HBase cluster to another, making sure the row(record) number matches. Such 
> check commonly won't require much on response time.
> Meanwhile, based on current impl, RowCounter will launch one mapper per 
> region, and each mapper will send one scan request. Assuming the table is 
> kind of big like having tens of regions, and the cpu core number of the whole 
> MR cluster is also enough, the parallel scan requests sent by mapper would be 
> a real burden for the HBase cluster.
> So in this JIRA, we're proposing to make rowcounter support an additional 
> option "--maps" to specify mapper number, and make each mapper able to scan 
> more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to