"Full" replication is a good idea, but I suggest we file it as a new bug/enhancement.

Actually placing a copy of a file on every node is probably rarely the right thing to do for "full" replication. One copy per switch would be my preferred default on our clusters (gigabit switches) and for .JAR files squareroot(numNodes) is probably the right answer.

e14

On Apr 8, 2006, at 12:16 PM, Bryan Pendleton (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-51? page=comments#action_12373745 ]

Bryan Pendleton commented on HADOOP-51:
---------------------------------------

Great!

A few comments from reading the patch (haven't test with it yet):
1) The <description> for dfs.replication.min is wrong
2) This is a wider concern, but on coding style - the idiom of conf.getType("config.value",defaultValue) is good for user-defined values, but shouldn't the default be skipped for things that are defined in hadoop-default.xml, in general? It takes away the value of hadoop-default, and it also means changing that value might or might not always have the desired system-wide results. 3) Wouldn't it be better to log at a severe level replications that are set below minReplication, or greater than maxReplication, and just set the replication to the nearest bound? Since replication is set per-file by the application, but min and max are probably set by the administrator of the hadoop cluster. Throwing an IOException causes failure where degraded performance would be preferable. 4) I may be dense, but I didn't see any way to specify that replication be "full", ie, a copy per datanode. I got the feeling this was something that was desired of this functionality (ie, for job.jar files, job configs, and lookup data used widely in a job) Using a short means, if we ever scale to > 32k nodes, there'd be no way to manually specify this. Just using Short.MAX_VALUE means getting a lot of errors about not being able to replicate as fully as desired.

Otherwise, this looks like a wonderful patch!

per-file replication counts
---------------------------

         Key: HADOOP-51
         URL: http://issues.apache.org/jira/browse/HADOOP-51
     Project: Hadoop
        Type: New Feature

  Components: dfs
    Versions: 0.2
    Reporter: Doug Cutting
    Assignee: Konstantin Shvachko
     Fix For: 0.2
 Attachments: Replication.patch

It should be possible to specify different replication counts for different files. Perhaps an option when creating a new file should be the desired replication count. MapReduce should take advantage of this feature so that job.xml and job.jar files, which are frequently accessed by lots of machines, are more highly replicated than large data files.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Reply via email to