Dear Folks, i'm having a custom implementation of InputSplit which contains a combination of multiple blocks (similar to CombineFileInputFormat). Each splits can have a different "data-locality category ": a) host-local: there is one host which contains one replica of each block b) rack-local: there is one rack which contains one replica of each block c) mixed: blocks coming from different rack
I'm wondering what InputSplit#getLocation() should return in all those cases so hadoop can make optimal use of data-locality... Should it contain all hosts which contains a replica of any of the blocks, sorted in a way the the hosts which contributes the most data come first ? Or should it contains only those host which were determined as most optimal regarding the data-locality during the splitting-process. F.e. in case (a). Should the location array only contain this one host, or should it contain all hosts but the one host with all the blocks should simply be on the first position ? best regards Johannes
