That's correct. That is why teragen, the program that generates data to be 
sorted in terasort is a MR program :-)

- Milind

On Oct 21, 2010, at 9:47 PM, elton sky wrote:

> Milind,
> 
> You are right. But that only happens when your client is one of the data
> nodes in HDFS. otherwise a random node will be picked up for the first
> replica.
> 
> On Fri, Oct 22, 2010 at 3:37 PM, Milind A Bhandarkar
> <[email protected]>wrote:
> 
>> If a file of say, 12.5 GB were produced by a single task with replication
>> 3, the default replication policy will ensure that the first replica of each
>> block will be created on local datanode. So, there will be one datanode in
>> the cluster that contains one replica of all blocks of that file. Map
>> placement hint specifies that node.
>> 
>> It's evil, I know :-)
>> 
>> - Milind
>> 
>> On Oct 21, 2010, at 1:30 PM, Alex Kozlov wrote:
>> 
>>> Hmm, this is interesting: how did it manage to keep the blocks local?
>> Why
>>> performance was better?
>>> 
>>> On Thu, Oct 21, 2010 at 11:43 AM, Owen O'Malley <[email protected]>
>> wrote:
>>> 
>>>> The block sizes were 2G. The input format made splits that were more
>> than a
>>>> block because that led to better performance.
>>>> 
>>>> -- Owen
>>>> 
>> 
>> --
>> Milind Bhandarkar
>> (mailto:[email protected])
>> (phone: 408-203-5213 W)
>> 
>> 
>> 

--
Milind Bhandarkar
(mailto:[email protected])
(phone: 408-203-5213 W)


Reply via email to