Re: starting merges before shuffle completion

Sameer Paranjpye Tue, 20 Nov 2007 14:13:11 -0800

Digging some more, it looks like we do the in RAM merges, but don't doany merges with the data on disk until the map phase finishes.


Sameer Paranjpye wrote:

The reduce phase does do merges as it's shuffling. It does a round ofin-memory merges because individual map outputs tend to be small enoughthat several of them can be kept in RAM (if they're too large they'respilt to disk). The results of the in-memory merges are spilt to diskand merged in their turn. The fan-in to the merge is configurable anddetermines how many merges happen.
This is how it *ought* to work. Have you observed anything different? Wemay have a bug or 3 to fix here.
Joydeep Sen Sarma wrote:
Hi folks,
I searched around JIRA and didn't find anything that resembled this. Is
this something on the roadmap?
For normal aggregations, this is never an issue. But in some cases
(typically joins) - map phase can emit lot of data and take quite a bit
of time doing it. Meanwhile the reducers seem to sit around copying data
slowly where they could be merging the map-outputs that are already
copied over.
Curious whether I have an outlier application or is this generally
useful/doable ..
Thx,
Joydeep

Re: starting merges before shuffle completion

Reply via email to