Hi.
I believe someone posted about this a while back, but it's worth
mentioning again.
I just ran a job on our 10 node cluster where the input data was
~70 empty sequence files, with our default settings this ran about ~200
mappers and ~70 reducers.
The job took almost exactly two minutes to finish.
How can we reduce this overhead?
* Pick number of mappers and reducers in a more dynamic way,
depending on the size of the input?
* JVM reuse, one jvm per job instead of one per task?
Any other ideas?
/Johan