[ http://issues.apache.org/jira/browse/HADOOP-64?page=comments#action_12426764 ] Milind Bhandarkar commented on HADOOP-64: -----------------------------------------
Thanks for your inputs Yoram, Konstantin, Bryan, Sameer. Here is my modified proposal: 1. The config parameter dfs.data.dir could have a list of directories separated by commas. 2. Another config parameter (client.buffer.dir) will contain comma-separated list of directories for buffering blocks until they are sent to datanode. DFS client will manage the in-memory map of blocks to these directories. 3. Datanode will maintain a map in memory of blockid's mapped to storage locations. 4. Datanode will choose appropriate location to write a block based on a separate block-to-volume placement strategy. Information about volumes will be made available to this strategy with DF. 5. Datanode will try to report correct available diskspace by appropriately taking into account the space reported by DF on each volume. If the mount point is same for more than one volume, then the available disk space will not be counted twice. 6. Storage-ID will be unique per data node, and will be stored in each of the volumes at top levels. 7. Each volume will further be separated into a shallow directory hierarchy, with maximum of N blocks per directory. This block to directory mapping will also be maintained in a hashtable by datanode. as a directory fills up, new directory will be created as a sibling, upto a maximum of N siblings. Then second level of directories will start. The parameter N can be specified as a config variable "dfs.data.numdir". 8. Only if all the volumes specified in dfs.data.dir are read-only, the datanode will shutdown. Otherwise, it will log the readonly directories, and treat them as if they were never specified in dfs.data.dir list. This behavior is consistent with current state of implementation. If there are any other issues to think about, please comment. > DataNode should be capable of managing multiple volumes > ------------------------------------------------------- > > Key: HADOOP-64 > URL: http://issues.apache.org/jira/browse/HADOOP-64 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Affects Versions: 0.2.0 > Reporter: Sameer Paranjpye > Assigned To: Milind Bhandarkar > Priority: Minor > Fix For: 0.6.0 > > > The dfs Datanode can only store data on a single filesystem volume. When a > node runs its disks JBOD this means running a Datanode per disk on the > machine. While the scheme works reasonably well on small clusters, on larger > installations (several 100 nodes) it implies a very large number of Datanodes > with associated management overhead in the Namenode. > The Datanod should be enhanced to be able to handle multiple volumes on a > single machine. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
