Re: use of inputSplit#getLocations()

Johannes Zillmann Tue, 03 May 2011 08:57:29 -0700

Hey Harsh,

thanks for you answer - thats what i wanted to know!
I wasn't sure after reading discussion in 
https://issues.apache.org/jira/browse/HADOOP-3293 if it makes sense to provide 
a) all locations sorted:
        hostA - 100% datalocality
        hostB - 80% datalocality
        hostC - 30% datalocality
b) the only the top locations sorted and cut the very poor choices:
        hostA - 100% datalocality
        hostB - 80% datalocality
c) or only the number one:
        hostA - 100% datalocality


Based on your input and internal discussion i would tend to (b) or maybe even 
(c).
best regards
Johannes


On Apr 21, 2011, at 3:58 PM, Harsh J wrote:

> Hello again,
> 
> The scheduler does take care of locality with a good algorithm.
> However, the String[] of locations is for one given logical split. It
> does not matter to the scheduler what your split really contains
> (single/multiple block or offsets), it only looks at 'where' to run
> the particular split favorably according to the location hostnames
> supplied. For every host supplied, it adds an entry to a cache set of
> <Node, Set<Local_TIPs>>.
> 
> For a split that's got 2 file offsets of 100% + 60%; giving the 100%
> local place alone, would be the best idea.
> 
> Have I understood your question right here?
> 
> On Thu, Apr 21, 2011 at 6:54 PM, Johannes Zillmann
> <[email protected]> wrote:
>> So how does inputSplit#getLocations() influence the task distribution ? I 
>> assume a split is assigned in favor to a tasktracker which matches one of 
>> its location !?
>> Assuming that we have a combined split and there is one location with 100% 
>> data-locality and one location with 60% data-locality.
>> If the task-sceheduler doesn't care about the order of specified locations 
>> wouldn't it be better to specify as split-location only the 100% location so 
>> that its more likely that the split is assigned to that host ?
>> 
>> 
>> Johannes
>> 
>> On Apr 21, 2011, at 9:37 AM, Harsh J wrote:
>> 
>>> Hey Johannes,
>>> 
>>> On Wed, Apr 20, 2011 at 3:37 PM, Johannes Zillmann
>>> <[email protected]> wrote:
>>>> Should it contain all hosts which contains a replica of any of the blocks, 
>>>> sorted in a way the the hosts which contributes the most data come first ?
>>>> Or should it contains only those host which were determined as most 
>>>> optimal regarding the data-locality during the splitting-process.
>>>> 
>>>> F.e. in case (a). Should the location array only contain this one host, or 
>>>> should it contain all hosts but the one host with all the blocks should 
>>>> simply be on the first position ?
>>> 
>>> Its better to send all locations for maximal locality, but the order
>>> is not considered AFAIK. Its the order of TT heartbeats at the JT that
>>> matters, instead.
>>> 
>>> --
>>> Harsh J
>> 
>> 
> 
> 
> 
> -- 
> Harsh J

Re: use of inputSplit#getLocations()

Reply via email to