Enis Soztutar created HBASE-16894:
-------------------------------------

             Summary: Create more than 1 split per region, generalize 
HBASE-12590
                 Key: HBASE-16894
                 URL: https://issues.apache.org/jira/browse/HBASE-16894
             Project: HBase
          Issue Type: Improvement
            Reporter: Enis Soztutar


A common request from users is to be able to better control how many map tasks 
are created per region. Right now, it is always 1 region = 1 input split = 1 
map task. Same goes for Spark since it uses the TIF. With region sizes as large 
as 50 GBs, it is desirable to be able to create more than 1 split per region.

HBASE-12590 adds a config property for MR jobs to be able to handle skew in 
region sizes. The algorithm is roughly: 
{code}
If (region size >= average size*ratio) : cut the region into two MR input splits
If (average size <= region size < average size*ratio) : one region as one MR 
input split
If (sum of several continuous regions size < average size * ratio): combine 
these regions into one MR input split.
{code}

Although we can set data skew ratio to be 0.5 or something to abuse HBASE-12590 
into creating more than 1 split task per region, it is not ideal. But there is 
no way to create more with the patch as it is. For example we cannot create 
more than 2 tasks per region. 

If we want to fix this properly, we should extend the approach in HBASE-12590, 
and make it so that the client can specify the desired num of mappers, or 
desired split size, and the TIF generates the splits based on the current 
region sizes very similar to the algorithm in HBASE-12590, but a more generic 
way. This also would eliminate the hand tuning of data skew ratio.

We also can think about the guidepost approach that Phoenix has in the stats 
table which is used for exactly this purpose. Right now, the region can be 
split into powers of two assuming uniform distribution within the region. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to