[ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Pal updated HBASE-24859:
--------------------------------
    Description: 
It has been observed that when the table has too many regions, MR jobs consume 
a lot of memory in the client. This is because we keep the region level 
information in memory and the memory heavy object is TableSplit because of the 
Scan object as a part of it.

However, it looks like the TableInputFormat for single table doesn't need to 
store the scan object in the TableSplit because we do not use it and all the 
splits are expected to have the exact same scan object. In TableInputFormat we 
use the scan object directly from the MR conf.

  was:
It has been observed that when the table has too many regions, MR jobs consume 
more memory in the client. This is because we keep the region level information 
in memory and the memory heavy object is TableSplit because of the Scan object 
as a part of it.
We can optimize the memory consumption by not loading the region level 
information if the region is empty based on the configuration.
The default configuration can lead to all TableSplits in memory (no change from 
the current), but the configuration can enable the map-reduce job to ignore the 
empty regions. The configuration can be a part of MR job based. 



> Improve the storage cost for HBase map reduce table splits
> ----------------------------------------------------------
>
>                 Key: HBASE-24859
>                 URL: https://issues.apache.org/jira/browse/HBASE-24859
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Major
>         Attachments: Screen Shot 2020-08-26 at 8.44.34 AM.png, hbase-24859.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume a lot of memory in the client. This is because we keep the region 
> level information in memory and the memory heavy object is TableSplit because 
> of the Scan object as a part of it.
> However, it looks like the TableInputFormat for single table doesn't need to 
> store the scan object in the TableSplit because we do not use it and all the 
> splits are expected to have the exact same scan object. In TableInputFormat 
> we use the scan object directly from the MR conf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to