Digging some more, it looks like we do the in RAM merges, but don't do
any merges with the data on disk until the map phase finishes.
Sameer Paranjpye wrote:
The reduce phase does do merges as it's shuffling. It does a round of
in-memory merges because individual map outputs tend to be small enough
that several of them can be kept in RAM (if they're too large they're
spilt to disk). The results of the in-memory merges are spilt to disk
and merged in their turn. The fan-in to the merge is configurable and
determines how many merges happen.
This is how it *ought* to work. Have you observed anything different? We
may have a bug or 3 to fix here.
Joydeep Sen Sarma wrote:
Hi folks,
I searched around JIRA and didn't find anything that resembled this. Is
this something on the roadmap?
For normal aggregations, this is never an issue. But in some cases
(typically joins) - map phase can emit lot of data and take quite a bit
of time doing it. Meanwhile the reducers seem to sit around copying data
slowly where they could be merging the map-outputs that are already
copied over.
Curious whether I have an outlier application or is this generally
useful/doable ..
Thx,
Joydeep