[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963852#comment-13963852
 ] 

Yu Li commented on HBASE-10932:
-------------------------------

Hi [~jdcryans] and [~ndimiduk],

Are you suggesting using the RowCounter in the old o.a.hadoop.hbase.mapred 
package? It seems to me this class is deprecated and it's using old mapred 
APIs. What's more, while issue the rowcounter command using hbase/hadoop 
script, it will launch RowCounter in o.a.hadoop.hbase.mapreduce package by 
default.

For the getSplits method in the new TableInputFormatBase, from the method 
comments, it's designed to make splits number matching number of regions, so I 
don't think this is a bug but something to improve for the *in-use* RowCounter:
{code}
  /**
   * Calculates the splits that will serve as input for the map tasks. The
   * number of splits matches the number of regions in a table.
   *
   * @param context  The current job context.
   * @return The list of input splits.
   * @throws IOException When creating the list of splits fails.
   * @see org.apache.hadoop.mapreduce.InputFormat#getSplits(
   *   org.apache.hadoop.mapreduce.JobContext)
   */
{code}
And this is the exact reason we introduce a new RowCounterInputFormat class to 
override the getSplits method rather than modifying the existing one.

As to the new parameter, yes user could pass -Dmapred.map.tasks, but I think it 
better to add an explicit parameter so user could see how it works from usage 
message. IMHO, assuming hbase users have background of MR might not be a good 
idea.

> Improve RowCounter to allow mapper number set/control
> -----------------------------------------------------
>
>                 Key: HBASE-10932
>                 URL: https://issues.apache.org/jira/browse/HBASE-10932
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Yu Li
>            Assignee: Yu Li
>            Priority: Minor
>         Attachments: HBASE-10932_v1.patch
>
>
> The typical use case of RowCounter is to do some kind of data integrity 
> checking, like after exporting some data from RDBMS to HBase, or from one 
> HBase cluster to another, making sure the row(record) number matches. Such 
> check commonly won't require much on response time.
> Meanwhile, based on current impl, RowCounter will launch one mapper per 
> region, and each mapper will send one scan request. Assuming the table is 
> kind of big like having tens of regions, and the cpu core number of the whole 
> MR cluster is also enough, the parallel scan requests sent by mapper would be 
> a real burden for the HBase cluster.
> So in this JIRA, we're proposing to make rowcounter support an additional 
> option "--maps" to specify mapper number, and make each mapper able to scan 
> more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to