Hi everyone, I would like to use Hadoop for analyzing tens of thousands of images. Ideally each mapper gets few hundred images to process and I'll have few hundred mappers. However, I want the mapper function to run on the machine where its images are stored. How can I achieve that. With text data creating splits and exploiting locality seems easy.
One option would be input to the map function would be a text file and that each line of the text file will contain name of the image to be processed. Now this text file is the i/p to the mapper function, so mapper parses the file and reads the image file name to be processed.. Unfortunately, one drawback of this scheme is that the image file itself might be stored on a machine different than the one running this mapper function. Copying the file over the network would be quite inefficient. Any help on this would be great.
