We have relatively heavy weight objects that we pass around the cluster
for our map/reduce tasks.
We have noticed that when we are using the multi threaded mapper, we
don't get very high utilization of either cpu or disk.
On investigating, we discovered that the entirety of the next(key,value)
and the entirety of the write( key, value) are synchronized on the file
object.
This causes all threads to back up on the serialization/deserialization.
Before we start coding, are there any current patches floating around
the shrink this critical window? It is pretty straight forward for
write, but not so simple for next.
We run multithreaded mappers because we have more cpu's than disk arms
on our cluster machines, and some of our tasks are inherently threaded
so we can't just set the maximum task number.
Thanks -- Jason