[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285528#comment-16285528
 ] 

Jerry He commented on HBASE-15482:
----------------------------------

Hi, [~water]

In the 002 patch, you added 'numTopsAtMost' in getBestLocations.  You will need 
another 'break' in the loop? Like:
If numTopsAtMost is met, then break out.

But again, the new code with this 'numTopsAtMost' is probably unnecessary.  The 
comment for the method getBestLocations has explained that it is not very 
likely you will get more than 3 hosts with at least 80% 
(hbase.tablesnapshotinputformat.locality.cutoff.multiplier) as much block 
locality as the top host with the best locality.  So you will break out early 
anyway with the filterWeight check. 
Your first patch's logic is good enough.
The added comment is good.
{code}
// As hostAndWeights is in descending order,
// we could break the loop as long as we meet a weight which is less than 
filterWeight
{code}

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-15482
>                 URL: https://issues.apache.org/jira/browse/HBASE-15482
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>            Reporter: Liyin Tang
>            Assignee: Xiang Li
>            Priority: Minor
>             Fix For: 2.1.0
>
>         Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to