[
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964318#comment-13964318
]
Jean-Daniel Cryans edited comment on HBASE-10932 at 4/9/14 4:08 PM:
--------------------------------------------------------------------
bq. Are you suggesting using the RowCounter in the old o.a.hadoop.hbase.mapred
package?
No, I'm suggesting fixing the new TableInputFormatBase to be able to get splits
to crossover regions. Why have a different InputFormat for RowCounter? What
makes RowCounter so special that it's the only MR job that would beneficiate
from this functionality?
I was pointing at the old TableInputFormatBase to show that it used to do this,
and that the new one doesn't do it (I'm guessing because MR doesn't pass
mapred.map.tasks as a hint anymore).
bq. IMHO, assuming hbase users have background of MR might not be a good idea.
I understand the concern but I don't see the value here since it's not like
you're trying to use a more HBase-y concept to describe mappers, the
configuration parameter is still called "maps". Even if you call it something
else, how do you then explain what it does without relying on MR concepts and
then how do you decide how mappers you need without having prior knowledge of
MR and your own cluster setup?
I think this new configuration parameter is more suitable for advanced usage
since to set it correctly you need to know how your cluster is laid out and you
think you can do better than the default behavior.
Going back to your original problem:
bq. Assuming the table is kind of big like having tens of regions, and the cpu
core number of the whole MR cluster is also enough, the parallel scan requests
sent by mapper would be a real burden for the HBase cluster.
In the MRv1 world you specify the number of mapper slots per machine, so using
this "maps" configuration may or may not lessen the burden on the cluster. For
example, 5 mapper slots per machine, 5 machines and 25 regions (so everything
fits nicely). By default, you'll get 25 mappers running at the same time, 5 per
machine. Let's say you use this new "maps" configuration and set to 20. Well
there's nothing preventing the JobTracker from filling up 4 machines and leave
one quiet (maybe because it's already running something, etc).
YARN does a much better job at this since it takes into account CPUs and
memory, so it might just solve your problem without requiring additional tuning.
was (Author: jdcryans):
bq. Are you suggesting using the RowCounter in the old o.a.hadoop.hbase.mapred
package?
No, I'm suggesting fixing the new TableInputFormatBase to be able to get splits
to crossover regions. Why have a different InputFormat for RowCounter? What
makes RowCounter so special that it's the only MR job that would beneficiate
from this functionality?
I was pointing at the old TableInputFormatBase to show that it used to do this,
and that the new one doesn't do it (I'm guessing because MR doesn't pass
mapred.map.tasks as a hint anymore).
bq. IMHO, assuming hbase users have background of MR might not be a good idea.
I understand the concern but I don't see the value here since it's not like
you're trying to use a more HBase-y concept to describe mappers, the
configuration parameter is still called "maps". Even if you call it something
else, how do you then explain what it does without relying on MR concepts and
then how do you decide how mappers you need without having prior knowledge of
MR and your own cluster setup?
I think this new configuration parameter is more suitable for advanced usage
since to set it correctly you need to know how your cluster is laid out and you
think you can do better than the default behavior.
Going back to your original problem:
bq. Assuming the table is kind of big like having tens of regions, and the cpu
core number of the whole MR cluster is also enough, the parallel scan requests
sent by mapper would be a real burden for the HBase cluster.
In the MRv1 world you specify the number of mapper slots per machine, so using
this --maps configuration may or may not lessen the burden on the cluster. For
example, 5 mapper slots per machine, 5 machines and 25 regions (so everything
fits nicely). By default, you'll get 25 mappers running at the same time, 5 per
machine. Let's say you use this new "--maps" configuration and set to 20. Well
there's nothing preventing the JobTracker from filling up 4 machines and leave
one quiet (maybe because it's already running something, etc).
YARN does a much better job at this since it takes into account CPUs and
memory, so it might just solve your problem without requiring additional tuning.
> Improve RowCounter to allow mapper number set/control
> -----------------------------------------------------
>
> Key: HBASE-10932
> URL: https://issues.apache.org/jira/browse/HBASE-10932
> Project: HBase
> Issue Type: Improvement
> Components: mapreduce
> Reporter: Yu Li
> Assignee: Yu Li
> Priority: Minor
> Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch
>
>
> The typical use case of RowCounter is to do some kind of data integrity
> checking, like after exporting some data from RDBMS to HBase, or from one
> HBase cluster to another, making sure the row(record) number matches. Such
> check commonly won't require much on response time.
> Meanwhile, based on current impl, RowCounter will launch one mapper per
> region, and each mapper will send one scan request. Assuming the table is
> kind of big like having tens of regions, and the cpu core number of the whole
> MR cluster is also enough, the parallel scan requests sent by mapper would be
> a real burden for the HBase cluster.
> So in this JIRA, we're proposing to make rowcounter support an additional
> option "--maps" to specify mapper number, and make each mapper able to scan
> more than one region of the target table.
--
This message was sent by Atlassian JIRA
(v6.2#6252)