Hi,

I have a large (~300 gig) zip of images that I need to process. My
current workflow is to copy the zip to HDFS, use a custom input format
to read the zip entries, do the processing in a map, and then generate
a processing report in the reduce. I'm struggling to tune params right
now with my cluster to make everything run smoothly, but I'm also
worried that I'm missing a better way of processing.

Does anybody have suggestions for how to make the processing of a zip
more parallel? The only other idea I had was uploading the zip as a
sequence file, but that proved incredibly slow (~30 hours on my 3 node
cluster to upload).

Thanks in advance.

-Andrew

Reply via email to