I just discovered some weird behavior on my cluster. If I start up a mapreduce job with inputs files much smaller than my block size, each input file is translated into two input splits, containing identical content. This is, in effect, doubling every single record I try to process. If I manually set mapred.min.split.size to my block size, I'm back to one split for file as I expected.
The input files are gzipped text, and I'm processing them with Hadoop Streaming. Any debugging suggestions? Cheers, David
