Hey Harsh, thanks for you answer - thats what i wanted to know! I wasn't sure after reading discussion in https://issues.apache.org/jira/browse/HADOOP-3293 if it makes sense to provide a) all locations sorted: hostA - 100% datalocality hostB - 80% datalocality hostC - 30% datalocality b) the only the top locations sorted and cut the very poor choices: hostA - 100% datalocality hostB - 80% datalocality c) or only the number one: hostA - 100% datalocality
Based on your input and internal discussion i would tend to (b) or maybe even (c). best regards Johannes On Apr 21, 2011, at 3:58 PM, Harsh J wrote: > Hello again, > > The scheduler does take care of locality with a good algorithm. > However, the String[] of locations is for one given logical split. It > does not matter to the scheduler what your split really contains > (single/multiple block or offsets), it only looks at 'where' to run > the particular split favorably according to the location hostnames > supplied. For every host supplied, it adds an entry to a cache set of > <Node, Set<Local_TIPs>>. > > For a split that's got 2 file offsets of 100% + 60%; giving the 100% > local place alone, would be the best idea. > > Have I understood your question right here? > > On Thu, Apr 21, 2011 at 6:54 PM, Johannes Zillmann > <[email protected]> wrote: >> So how does inputSplit#getLocations() influence the task distribution ? I >> assume a split is assigned in favor to a tasktracker which matches one of >> its location !? >> Assuming that we have a combined split and there is one location with 100% >> data-locality and one location with 60% data-locality. >> If the task-sceheduler doesn't care about the order of specified locations >> wouldn't it be better to specify as split-location only the 100% location so >> that its more likely that the split is assigned to that host ? >> >> >> Johannes >> >> On Apr 21, 2011, at 9:37 AM, Harsh J wrote: >> >>> Hey Johannes, >>> >>> On Wed, Apr 20, 2011 at 3:37 PM, Johannes Zillmann >>> <[email protected]> wrote: >>>> Should it contain all hosts which contains a replica of any of the blocks, >>>> sorted in a way the the hosts which contributes the most data come first ? >>>> Or should it contains only those host which were determined as most >>>> optimal regarding the data-locality during the splitting-process. >>>> >>>> F.e. in case (a). Should the location array only contain this one host, or >>>> should it contain all hosts but the one host with all the blocks should >>>> simply be on the first position ? >>> >>> Its better to send all locations for maximal locality, but the order >>> is not considered AFAIK. Its the order of TT heartbeats at the JT that >>> matters, instead. >>> >>> -- >>> Harsh J >> >> > > > > -- > Harsh J
