[
https://issues.apache.org/jira/browse/HBASE-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wymer updated HBASE-5140:
------------------------------
Tags: mapreduce splits tableinputformat
Fix Version/s: 0.90.4
Labels: mapreduce split (was: )
Affects Version/s: 0.90.4
Release Note: Used the 0.90 branch for the patch but code looks
compatible in trunk as well (with one deprecated method)
Status: Patch Available (was: Open)
This change introduces two new properties: hbase.mapreduce.splitsPerRegion and
hbase.mapreduce.splitKeyBytePrecision.
Setting hbase.mapreduce.splitsPerRegion to anything > 1 will result in each
region being split into that number of splits. If nothing is passed or 1 is
passed, the TableInputFormat will execute as usual (one split per region).
The splitKeyBytePrecision determines the byte length (64 by default) to use
when generating a max value byte array in the case that the region's end key is
of zero length (e.g. the region that contains the last row). This is required
to try and "guess" at split distributions for that region. If keys are fully
distributed, this should result in fairly equal splits.
The Bytes.split utility is used to split the range between the start and end
keys n number of times.
> TableInputFormat subclass to allow N number of splits per region during MR
> jobs
> -------------------------------------------------------------------------------
>
> Key: HBASE-5140
> URL: https://issues.apache.org/jira/browse/HBASE-5140
> Project: HBase
> Issue Type: New Feature
> Components: mapreduce
> Affects Versions: 0.90.4
> Reporter: Josh Wymer
> Priority: Trivial
> Labels: mapreduce, split
> Fix For: 0.90.4
>
> Attachments:
> Added_functionality_to_split_n_times_per_region_on_mapreduce_jobs.patch
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> In regards to [HBASE-5138|https://issues.apache.org/jira/browse/HBASE-5138] I
> am working on a subclass for the TableInputFormat class that overrides
> getSplits in order to generate N number of splits per regions and/or N number
> of splits per job. The idea is to convert the startKey and endKey for each
> region from byte[] to BigDecimal, take the difference, divide by N, convert
> back to byte[] and generate splits on the resulting values. Assuming your
> keys are fully distributed this should generate splits at nearly the same
> number of rows per split. Any suggestions on this issue are welcome.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira