[ http://issues.apache.org/jira/browse/HADOOP-620?page=all ]
Sameer Paranjpye updated HADOOP-620: ------------------------------------ Component/s: dfs Description: Currently 'dfs -report' calculates replication facto like the following : (totalCapacity - totalDiskRemaining) / (totalSize of dfs files in Name space). Problem with this is that this includes disk space used by non-dfs files (e.g. map reduce jobs) on data node. On my single node test, I get replication factor of 100 since I have a 1 GB dfs file with out replication and there is 99GB of unrelated data on the same volume. ideally name should calculate it with : (total size of all the blocks known to it) / (total size of files in Name space). Initial proposal to keep 'total size of all the blocks' update is to track it in datanode descriptor and update it when namenode receives block reports from the datanode ( and subtract when the datanode is removed). was: Currently 'dfs -report' calculates replication facto like the following : (totalCapacity - totalDiskRemaining) / (totalSize of dfs files in Name space). Problem with this is that this includes disk space used by non-dfs files (e.g. map reduce jobs) on data node. On my single node test, I get replication factor of 100 since I have a 1 GB dfs file with out replication and there is 99GB of unrelated data on the same volume. ideally name should calculate it with : (total size of all the blocks known to it) / (total size of files in Name space). Initial proposal to keep 'total size of all the blocks' update is to track it in datanode descriptor and update it when namenode receives block reports from the datanode ( and subtract when the datanode is removed). > replication factor should be calucalated based on actual dfs block sizes at > the NameNode. > ----------------------------------------------------------------------------------------- > > Key: HADOOP-620 > URL: http://issues.apache.org/jira/browse/HADOOP-620 > Project: Hadoop > Issue Type: Bug > Components: dfs > Reporter: Raghu Angadi > Assigned To: Raghu Angadi > Priority: Minor > > Currently 'dfs -report' calculates replication facto like the following : > (totalCapacity - totalDiskRemaining) / (totalSize of dfs files in Name > space). > Problem with this is that this includes disk space used by non-dfs files > (e.g. map reduce jobs) on data node. On my single node test, I get > replication factor of 100 since I have a 1 GB dfs file with out replication > and there is 99GB of unrelated data on the same volume. > ideally name should calculate it with : (total size of all the blocks known > to it) / (total size of files in Name space). > Initial proposal to keep 'total size of all the blocks' update is to track it > in datanode descriptor and update it when namenode receives block reports > from the datanode ( and subtract when the datanode is removed). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira