Joydeep Sen Sarma wrote:
Would be cool to get an option to reduce replication factor for reduce
outputs.
This should be as simple as setting dfs.replication on a job. If that
does not work, it's a bug.
Hard to buy the argument that there's gonna be no performance win with
direct streaming between jobs. Currently reduce jobs start reading map
outputs before all maps are complete - and I am sure this results in
significant speedup.
That's not exactly true. No reduces can be performed until all maps are
complete. However the shuffle (transfer/sort of intermediate data)
happens in parallel with mapping.
Using the same logic, streaming reduce outputs to
the next map and reduce steps (before the first reduce is complete)
should also provide speedup.
Perhaps, but the bookeeping required in the jobtracker might be onerous.
The failure modes are more complex, complicating recovery.
If the streaming option were available, the programmer would have a
clear choice: excellent best case/poor worst case performance with
streaming or good best case/good worst case performance with hdfs based
checkpointing. I think this is a choice that the job-writer is competent
enough to make.
Please feel free to try to implement this. If you can develop a patch
which implements this reliably in maintainable code, then it would
probably be committed.
Doug