More to the specific point, yes, all 100 nodes will wind up storing data for large files because blocks should be assigned pretty much at random.
The exception is files that originate on a datanode. There, the local node gets one copy of each block. Replica blocks follow the random rule, however, so that you wind up in the same place in the end. On 12/10/07 1:10 PM, "dhruba Borthakur" <[EMAIL PROTECTED]> wrote: > The replication factor should be such that it can provide some level of > availability and performance. HDFS attempts to distribute replicas of a > block so that they reside across multiple racks. HDFS block replication > is *purely* block-based and file-agnostic; i.e. blocks belonging to the > same file are handled precisely the same way as blocks belonging to > different files. > > Hope this helps, > dhruba > > Also, are there any metrics or best practices around what the > replication factor should be based on the number of nodes in the grid? > Does HDFS attempt to involve all nodes in the grid in replication? In > other words, if I have 100 nodes in my grid, and a replication factor of > 6, will all 100 nodes wind up storing data for a given file assuming the > file large enough? > > Thanks, > C G > > > --------------------------------- > Looking for last minute shopping deals? Find them fast with Yahoo! > Search.