[ 
https://issues.apache.org/jira/browse/HBASE-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870441#action_12870441
 ] 

Andrew Purtell commented on HBASE-2468:
---------------------------------------

bq. For TIF, I don't see any reason to prefetch meta. Each mapper taking 
*input* from a table accesses only a small subset of regions, and scanning the 
whole thing is total overkill

Yes, good point. I was just generally talking about Table*Format so muddied the 
waters.

Your option B is the use case most interesting IMHO but not just long lived 
clients would benefit; any client writing a reasonable amount of data with well 
distributed keys would get off the ground faster while being more efficient 
about META accesses.

So the current patch can 1) prefetch ahead a few rows on META miss and 2) can 
be reworked slightly to provide clients an option for preloading region 
locations for a table from META. I suggest we get the current patch cleaned up 
and committed as something which can optionally be enabled for a table:

{code}
HTable table = new HTable("foo");
table.setLocationPrefetch(true);
{code}

Prefetch would scan ahead some small configurable number of rows upon cache 
miss as is discussed in comments above.

Additionally, we could add 
{{HTable#setRegionsInfo(Map<HRegionInfo,HServerAddress> regionMap)}}

{code}
// getRegionsInfo does not update the region location cache for the table
// we don't want to change that
Map<HRegionInfo,HServerAddress> regionMap = table.getRegionsInfo();
// but we can use the result to do so at the client's option
table.setRegionsInfo(regionMap);
{code}

Regarding whether or not to prefetch all region info for a table when starting 
a MR job, anticipating it would be a toggle, if the default is 'true' I think 
this would help common cases especially encountered by newcomers if TOF did 
that but could/should always be turned off if the operator is running jobs with 
high frequency/concurrency (so only some limited readahead on miss would be 
active then). But I do think this is a better idea:

bq. So, for the MR case, I think we should provide the option to serialize META 
to disk, put it in the DistributedCache, and then prewarm the meta cache from 
there.

Serialize the result of HTable#getRegionsInfo when setting up the job (in 
{{TableMapReduceUtil}}). Load from DistributedCache and pass to 
HTable#setRegionsInfo in TOF is jobconf indicates it is available. 


> Improvements to prewarm META cache on clients
> ---------------------------------------------
>
>                 Key: HBASE-2468
>                 URL: https://issues.apache.org/jira/browse/HBASE-2468
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: client
>            Reporter: Todd Lipcon
>            Assignee: Mingjie Lai
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2468-trunk.patch
>
>
> A couple different use cases cause storms of reads to META during startup. 
> For example, a large MR job will cause each map task to hit meta since it 
> starts with an empty cache.
> A couple possible improvements have been proposed:
>  - MR jobs could ship a copy of META for the table in the DistributedCache
>  - Clients could prewarm cache by doing a large scan of all the meta for the 
> table instead of random reads for each miss
>  - Each miss could fetch ahead some number of rows in META

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to