That's correct. That is why teragen, the program that generates data to be sorted in terasort is a MR program :-)
- Milind On Oct 21, 2010, at 9:47 PM, elton sky wrote: > Milind, > > You are right. But that only happens when your client is one of the data > nodes in HDFS. otherwise a random node will be picked up for the first > replica. > > On Fri, Oct 22, 2010 at 3:37 PM, Milind A Bhandarkar > <[email protected]>wrote: > >> If a file of say, 12.5 GB were produced by a single task with replication >> 3, the default replication policy will ensure that the first replica of each >> block will be created on local datanode. So, there will be one datanode in >> the cluster that contains one replica of all blocks of that file. Map >> placement hint specifies that node. >> >> It's evil, I know :-) >> >> - Milind >> >> On Oct 21, 2010, at 1:30 PM, Alex Kozlov wrote: >> >>> Hmm, this is interesting: how did it manage to keep the blocks local? >> Why >>> performance was better? >>> >>> On Thu, Oct 21, 2010 at 11:43 AM, Owen O'Malley <[email protected]> >> wrote: >>> >>>> The block sizes were 2G. The input format made splits that were more >> than a >>>> block because that led to better performance. >>>> >>>> -- Owen >>>> >> >> -- >> Milind Bhandarkar >> (mailto:[email protected]) >> (phone: 408-203-5213 W) >> >> >> -- Milind Bhandarkar (mailto:[email protected]) (phone: 408-203-5213 W)
