Hi, I have a large (~300 gig) zip of images that I need to process. My current workflow is to copy the zip to HDFS, use a custom input format to read the zip entries, do the processing in a map, and then generate a processing report in the reduce. I'm struggling to tune params right now with my cluster to make everything run smoothly, but I'm also worried that I'm missing a better way of processing.
Does anybody have suggestions for how to make the processing of a zip more parallel? The only other idea I had was uploading the zip as a sequence file, but that proved incredibly slow (~30 hours on my 3 node cluster to upload). Thanks in advance. -Andrew
