I just discovered some weird behavior on my cluster. If I start up a
mapreduce job with inputs files much smaller than my block size, each
input file is translated into two input splits, containing identical
content. This is, in effect, doubling every single record I try to
process. If I manually set mapred.min.split.size to my block size, I'm
back to one split for file as I expected.

The input files are gzipped text, and I'm processing them with Hadoop Streaming.

Any debugging suggestions?

Cheers,
David

Reply via email to