[ 
https://issues.apache.org/jira/browse/HBASE-24859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185363#comment-17185363
 ] 

Andrew Kyle Purtell edited comment on HBASE-24859 at 8/26/20, 5:19 PM:
-----------------------------------------------------------------------

bq. It looks like > 99.99% (432696 * 15210)/6581387456) of the heap (from the 
arraylist) is occupied by the serialized scan object. That seems fishy. Firstly 
it seems to be only used in "multi" table input format case and I think we can 
get rid of it in the single table case. Even in multi table input format, 
AFAICT, we don't need to serialize the entire scan object (essentially avoiding 
duplication of start end keys etc). We can just set a "mini" optimized scan 
object or amend the TableSplit object to add required additional fields. WDYT.

This. ^

If we accept this analysis then region location isn't the dominant use of heap 
anyway, optimization should focus on this unnecessary duplication and wastage. 


was (Author: apurtell):
bq. It looks like > 99.99% (432696 * 15210)/6581387456) of the heap (from the 
arraylist) is occupied by the serialized scan object. That seems fishy. Firstly 
it seems to be only used in "multi" table input format case and I think we can 
get rid of it in the single table case. Even in multi table input format, 
AFAICT, we don't need to serialize the entire scan object (essentially avoiding 
duplication of start end keys etc). We can just set a "mini" optimized scan 
object or amend the TableSplit object to add required additional fields. WDYT.

This. ^

> Remove the empty regions from the hbase mapreduce splits
> --------------------------------------------------------
>
>                 Key: HBASE-24859
>                 URL: https://issues.apache.org/jira/browse/HBASE-24859
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Major
>         Attachments: hbase-24859.png, screenshot-1.png
>
>
> It has been observed that when the table has too many regions, MR jobs 
> consume more memory in the client. This is because we keep the region level 
> information in memory and the memory heavy object is TableSplit because of 
> the Scan object as a part of it.
> We can optimize the memory consumption by not loading the region level 
> information if the region is empty based on the configuration.
> The default configuration can lead to all TableSplits in memory (no change 
> from the current), but the configuration can enable the map-reduce job to 
> ignore the empty regions. The configuration can be a part of MR job based. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to