[jira] [Comment Edited] (HBASE-10932) Improve RowCounter to allow mapper number set/control

Jean-Daniel Cryans (JIRA) Wed, 09 Apr 2014 09:10:33 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964318#comment-13964318
 ]


Jean-Daniel Cryans edited comment on HBASE-10932 at 4/9/14 4:08 PM:
--------------------------------------------------------------------

bq. Are you suggesting using the RowCounter in the old o.a.hadoop.hbase.mapred 
package?

No, I'm suggesting fixing the new TableInputFormatBase to be able to get splits 
to crossover regions. Why have a different InputFormat for RowCounter? What 
makes RowCounter so special that it's the only MR job that would beneficiate 
from this functionality? 

I was pointing at the old TableInputFormatBase to show that it used to do this, 
and that the new one doesn't do it (I'm guessing because MR doesn't pass 
mapred.map.tasks as a hint anymore).

bq. IMHO, assuming hbase users have background of MR might not be a good idea.

I understand the concern but I don't see the value here since it's not like 
you're trying to use a more HBase-y concept to describe mappers, the 
configuration parameter is still called "maps". Even if you call it something 
else, how do you then explain what it does without relying on MR concepts and 
then how do you decide how mappers you need without having prior knowledge of 
MR and your own cluster setup?

I think this new configuration parameter is more suitable for advanced usage 
since to set it correctly you need to know how your cluster is laid out and you 
think you can do better than the default behavior.

Going back to your original problem:

bq. Assuming the table is kind of big like having tens of regions, and the cpu 
core number of the whole MR cluster is also enough, the parallel scan requests 
sent by mapper would be a real burden for the HBase cluster.

In the MRv1 world you specify the number of mapper slots per machine, so using 
this "maps" configuration may or may not lessen the burden on the cluster. For 
example, 5 mapper slots per machine, 5 machines and 25 regions (so everything 
fits nicely). By default, you'll get 25 mappers running at the same time, 5 per 
machine. Let's say you use this new "maps" configuration and set to 20. Well 
there's nothing preventing the JobTracker from filling up 4 machines and leave 
one quiet (maybe because it's already running something, etc).

YARN does a much better job at this since it takes into account CPUs and 
memory, so it might just solve your problem without requiring additional tuning.


was (Author: jdcryans):
bq. Are you suggesting using the RowCounter in the old o.a.hadoop.hbase.mapred 
package?

No, I'm suggesting fixing the new TableInputFormatBase to be able to get splits 
to crossover regions. Why have a different InputFormat for RowCounter? What 
makes RowCounter so special that it's the only MR job that would beneficiate 
from this functionality? 

I was pointing at the old TableInputFormatBase to show that it used to do this, 
and that the new one doesn't do it (I'm guessing because MR doesn't pass 
mapred.map.tasks as a hint anymore).

bq. IMHO, assuming hbase users have background of MR might not be a good idea.

I understand the concern but I don't see the value here since it's not like 
you're trying to use a more HBase-y concept to describe mappers, the 
configuration parameter is still called "maps". Even if you call it something 
else, how do you then explain what it does without relying on MR concepts and 
then how do you decide how mappers you need without having prior knowledge of 
MR and your own cluster setup?

I think this new configuration parameter is more suitable for advanced usage 
since to set it correctly you need to know how your cluster is laid out and you 
think you can do better than the default behavior.

Going back to your original problem:

bq. Assuming the table is kind of big like having tens of regions, and the cpu 
core number of the whole MR cluster is also enough, the parallel scan requests 
sent by mapper would be a real burden for the HBase cluster.

In the MRv1 world you specify the number of mapper slots per machine, so using 
this --maps configuration may or may not lessen the burden on the cluster. For 
example, 5 mapper slots per machine, 5 machines and 25 regions (so everything 
fits nicely). By default, you'll get 25 mappers running at the same time, 5 per 
machine. Let's say you use this new "--maps" configuration and set to 20. Well 
there's nothing preventing the JobTracker from filling up 4 machines and leave 
one quiet (maybe because it's already running something, etc).

YARN does a much better job at this since it takes into account CPUs and 
memory, so it might just solve your problem without requiring additional tuning.

> Improve RowCounter to allow mapper number set/control
> -----------------------------------------------------
>
>                 Key: HBASE-10932
>                 URL: https://issues.apache.org/jira/browse/HBASE-10932
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Minor
>         Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch
>
>
> The typical use case of RowCounter is to do some kind of data integrity 
> checking, like after exporting some data from RDBMS to HBase, or from one 
> HBase cluster to another, making sure the row(record) number matches. Such 
> check commonly won't require much on response time.
> Meanwhile, based on current impl, RowCounter will launch one mapper per 
> region, and each mapper will send one scan request. Assuming the table is 
> kind of big like having tens of regions, and the cpu core number of the whole 
> MR cluster is also enough, the parallel scan requests sent by mapper would be 
> a real burden for the HBase cluster.
> So in this JIRA, we're proposing to make rowcounter support an additional 
> option "--maps" to specify mapper number, and make each mapper able to scan 
> more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (HBASE-10932) Improve RowCounter to allow mapper number set/control

Reply via email to