Hi, I have a question about how to efficiently access multiple files during the Reduce phase. The reducer gets a <key, list of values> where each key is a different file and the value represents where to look in the file. The files are actually .png images.
I have tried using the DistributedCache, where I copy all the files to the HDFS and then during the Reduce phase, I look in Path [] localFiles = DistributedCache.getLocalCacheFiles(configuration); and then I pick the appropriate path of the file I need from localFiles and process it. However, I'm noticing that it's taking a long time to copy the files to the HDFS. I'm wondering if it'd be better to leave the files on the local file system and then during the Reduce phase, open the file directly. I don't know if this is possible though. In general, I'm wondering how to efficiently access multiple files during either the Map/Reduce phase? Is DistributedCache the best way? Thanks. -- View this message in context: http://www.nabble.com/accessing-multiple-files-in-Reducer-tp23413154p23413154.html Sent from the Hadoop core-user mailing list archive at Nabble.com.