The replication factor should be such that it can provide some level of availability and performance. HDFS attempts to distribute replicas of a block so that they reside across multiple racks. HDFS block replication is *purely* block-based and file-agnostic; i.e. blocks belonging to the same file are handled precisely the same way as blocks belonging to different files.
Hope this helps, dhruba Also, are there any metrics or best practices around what the replication factor should be based on the number of nodes in the grid? Does HDFS attempt to involve all nodes in the grid in replication? In other words, if I have 100 nodes in my grid, and a replication factor of 6, will all 100 nodes wind up storing data for a given file assuming the file large enough? Thanks, C G --------------------------------- Looking for last minute shopping deals? Find them fast with Yahoo! Search.