in this case the map data is large enough that in-memory merges proably had no 
effect (but thanks for pointing that out). (the map.out files were about 
256-512MB in size - block compressed sequencefiles).

if we could initiate the on-disk merges - that would be awesome. 

i am curious whether people think this will help the sort benchmark as well?


-----Original Message-----
From: Sameer Paranjpye [mailto:[EMAIL PROTECTED]
Sent: Tue 11/20/2007 2:12 PM
To: [email protected]
Subject: Re: starting merges before shuffle completion
 
Digging some more, it looks like we do the in RAM merges, but don't do 
any merges with the data on disk until the map phase finishes.


Sameer Paranjpye wrote:
> The reduce phase does do merges as it's shuffling. It does a round of 
> in-memory merges because individual map outputs tend to be small enough 
> that several of them can be kept in RAM (if they're too large they're 
> spilt to disk). The results of the in-memory merges are spilt to disk 
> and merged in their turn. The fan-in to the merge is configurable and 
> determines how many merges happen.
> 
> This is how it *ought* to work. Have you observed anything different? We 
> may have a bug or 3 to fix here.
> 
> 
> Joydeep Sen Sarma wrote:
>> Hi folks,
>>
>>  
>>
>> I searched around JIRA and didn't find anything that resembled this. Is
>> this something on the roadmap?
>>
>>  
>>
>> For normal aggregations, this is never an issue. But in some cases
>> (typically joins) - map phase can emit lot of data and take quite a bit
>> of time doing it. Meanwhile the reducers seem to sit around copying data
>> slowly where they could be merging the map-outputs that are already
>> copied over.
>>  
>>
>> Curious whether I have an outlier application or is this generally
>> useful/doable ..
>>
>>  
>>
>> Thx,
>>
>>  
>>
>> Joydeep
>>
>>  
>>
>>
> 



Reply via email to