[ 
https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xinxin fan updated HBASE-18090:
-------------------------------
    Release Note: 
In this task, we make it possible to run multiple mappers per region in the 
table snapshot. The following code is primary table snapshot mapper 
initializatio: 

TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
);

The job only run one map task per region in the table snapshot. With this 
feature, client can specify the desired num of mappers when init table snapshot 
mapper job:

TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
          splitAlgorithm,                     // splitAlgo algorithm to split, 
current split algorithms  support RegionSplitter.UniformSplit() and 
RegionSplitter.HexStringSplit()
          n                                         // how many input splits to 
generate per one region
);

  was:
In this task, we make it possible to run multiple mappers per region in the 
table snapshot. The following code is primary table snapshot mapper 
initializatio: 

TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
);

The job only run one map task per region in the table snapshot. With this 
feature, client can specify the desired num of mappers when init table snapshot 
mapper job:

TableMapReduceUtil.initTableSnapshotMapperJob(
          snapshotName,                     // The name of the snapshot (of a 
table) to read from
          scan,                                      // Scan instance to 
control CF and attribute selection
          mapper,                                 // mapper
          outputKeyClass,                   // mapper output key 
          outputValueClass,                // mapper output value
          job,                                       // The current job to 
adjust
          true,                                     // upload HBase jars and 
jars for any of the configured job classes via the distributed cache (tmpjars)
          restoreDir,                           // a temporary directory to 
copy the snapshot files into
          splitAlgorithm,                     // splitAlgo algorithm to split, 
current split algorithms only support RegionSplitter.UniformSplit() and 
RegionSplitter.HexStringSplit()
          n                                         // how many input splits to 
generate per one region
);


> Improve TableSnapshotInputFormat to allow more multiple mappers per region
> --------------------------------------------------------------------------
>
>                 Key: HBASE-18090
>                 URL: https://issues.apache.org/jira/browse/HBASE-18090
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Mikhail Antonov
>            Assignee: xinxin fan
>             Fix For: 2.0.0-beta-1
>
>         Attachments: HBASE-18090-V3-master.patch, 
> HBASE-18090-V4-master.patch, HBASE-18090-V5-master.patch, 
> HBASE-18090-branch-1-v2.patch, HBASE-18090-branch-1-v2.patch, 
> HBASE-18090-branch-1.3-v1.patch, HBASE-18090-branch-1.3-v2.patch, 
> HBASE-18090.branch-1.patch
>
>
> TableSnapshotInputFormat runs one map task per region in the table snapshot. 
> This places unnecessary restriction that the region layout of the original 
> table needs to take the processing resources available to MR job into 
> consideration. Allowing to run multiple mappers per region (assuming 
> reasonably even key distribution) would be useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to