use of inputSplit#getLocations()

Johannes Zillmann Wed, 20 Apr 2011 03:08:33 -0700

Dear Folks,

i'm having a custom implementation of InputSplit which contains a combination 
of multiple blocks (similar to CombineFileInputFormat).
Each splits can have a different "data-locality category ":
a) host-local: there is one host which contains one replica of each block
b) rack-local: there is one rack which contains one replica of each block
c) mixed: blocks coming from different rack


I'm wondering what InputSplit#getLocation() should return in all those cases so 
hadoop can make optimal use of data-locality...

Should it contain all hosts which contains a replica of any of the blocks, 
sorted in a way the the hosts which contributes the most data come first ?
Or should it contains only those host which were determined as most optimal 
regarding the data-locality during the splitting-process. 

F.e. in case (a). Should the location array only contain this one host, or 
should it contain all hosts but the one host with all the blocks should simply 
be on the first position ?

best regards
Johannes

use of inputSplit#getLocations()

Reply via email to