[jira] [Created] (HBASE-8768) Improve bulk load performance by moving key value construction from map phase to reduce phase.

rajeshbabu (JIRA) Tue, 18 Jun 2013 22:50:02 -0700

rajeshbabu created HBASE-8768:
---------------------------------

             Summary: Improve bulk load performance by moving key value 
construction from map phase to reduce phase.
                 Key: HBASE-8768
                 URL: https://issues.apache.org/jira/browse/HBASE-8768
             Project: HBase
          Issue Type: Improvement
          Components: mapreduce, Performance
            Reporter: rajeshbabu
            Assignee: rajeshbabu



ImportTSV bulkloading approach uses MapReduce framework. Existing mapper and 
reducer classes used by ImportTSV are TsvImporterMapper.java and 
PutSortReducer.java. ImportTSV tool parses the tab(by default) seperated values 
from the input files and Mapper class generates the PUT objects for each row 
using the Key value pairs created from the parsed text. PutSortReducer then 
uses the partions based on the regions and sorts the Put objects for each 
region. 

Overheads we can see in the above approach:
==========================================
1) keyvalue construction for each parsed value in the line adding extra data 
like rowkey,columnfamily,qualifier which will increase around 5x extra data to 
be shuffled in reduce phase.
We can calculate data size to shuffled as below
{code}
 Data to be shuffled = nl*nt*(rl+cfl+cql+vall+tsl+30)
{code}

If we move keyvalue construction to reduce phase we datasize to be shuffle will 
be which is very less compared to above.
{code}
 Data to be shuffled = nl*nt*(rl+vall)
{code}

nl - Number of lines in the raw file
nt - Number of tabs or columns including row key.
rl - row length which will be different for each line.
cfl - column family length which will be different for each family
cql - qualifier length
tsl - timestamp length.
vall- each parsed value length.
30 bytes for kv size,number of families etc.

2) In mapper side we are creating put objects by adding all keyvalues 
constructed for each line and in reducer we will again collect keyvalues from 
put and sort them.
Instead we can directly create and sort keyvalues in reducer.


Solution:
========
We can improve bulk load performance by moving the key value construction from 
mapper to reducer so that Mapper just sends the raw text for each row to the 
Reducer. Reducer then parses the records for rows and create and sort the key 
value pairs before writing to HFiles. 
Conclusion:
===========
The above suggestions will improve map phase performance by avoiding keyvalue 
construction and reduce phase performance by avoiding excess data to be 
shuffled.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-8768) Improve bulk load performance by moving key value construction from map phase to reduce phase.

Reply via email to