[jira] [Updated] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region

xinxin fan (JIRA) Thu, 16 Nov 2017 18:23:54 -0800

     [ 
https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xinxin fan updated HBASE-18090:
-------------------------------
    Release Note: 
In this task, we make it possible to run multiple mappers per region in the 
table snapshot. The following code is primary table snapshot mapper 
initializatio: 

TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
);

The job only run one map task per region in the table snapshot. With this 
feature, client can specify the desired num of mappers when init table snapshot 
mapper job：

TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
          splitAlgorithm,                     // splitAlgo algorithm to split, 
current split algorithms only support RegionSplitter.UniformSplit() and 
RegionSplitter.HexStringSplit()
          n                                         // how many input splits to 
generate per one region
);

  was:
In this task, we make it possible to run multiple mappers per region in the 
table snapshot. 

The following code is primary table snapshot mapper initializatio: 
TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
);

The job only run one map task per region in the table snapshot. With this 
feature, client can specify the desired num of mappers when init table snapshot 
mapper job：

TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
          splitAlgorithm,                     // splitAlgo algorithm to split, 
current split algorithms only support RegionSplitter.UniformSplit() and 
RegionSplitter.HexStringSplit()
          n                                         // how many input splits to 
generate per one region
);


> Improve TableSnapshotInputFormat to allow more multiple mappers per region
> --------------------------------------------------------------------------
>
>                 Key: HBASE-18090
>                 URL: https://issues.apache.org/jira/browse/HBASE-18090
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Mikhail Antonov
>            Assignee: xinxin fan
>             Fix For: 2.0.0-beta-1
>
>         Attachments: HBASE-18090-V3-master.patch, 
> HBASE-18090-V4-master.patch, HBASE-18090-V5-master.patch, 
> HBASE-18090-branch-1-v2.patch, HBASE-18090-branch-1-v2.patch, 
> HBASE-18090-branch-1.3-v1.patch, HBASE-18090-branch-1.3-v2.patch, 
> HBASE-18090.branch-1.patch
>
>
> TableSnapshotInputFormat runs one map task per region in the table snapshot. 
> This places unnecessary restriction that the region layout of the original 
> table needs to take the processing resources available to MR job into 
> consideration. Allowing to run multiple mappers per region (assuming 
> reasonably even key distribution) would be useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region

Reply via email to