[ 
http://issues.apache.org/jira/browse/HADOOP-64?page=comments#action_12426338 ] 
            
Milind Bhandarkar commented on HADOOP-64:
-----------------------------------------

Proposal:

In the configuration (e.g. hadoop-site.xml), site-admin can specify a comma 
separated list of volumes as a value corresponding to key "dfs.data.dir". These 
volumes are assumed to be mounted on different disks. Thus the total disk 
capacity for the datanode is assumed to be a sum of disk capacities of these 
volumes, in addition, taking into account the /dev/sda* or /dev/hda* mapping of 
these volumes (i.e. not counting the same /dev/* twice.)

New blocks are created round-robin in these volumes. The policy for 
block-allocation is controlled by a separable piece of code, so that different 
policies can be substituted at runtime later. Mapping of datablocks to 
volume-id is kept in memory of datanode. When the datanode comes up again, it 
discovers this mapping by reading specified volumes. Later, when datanode is 
also periodically checkpointed, this mapping is stored in the checkpoint as 
well.

Each volume is further automatically split into multiple subdirectories (the 
number of these directories is configurable, and should be a power of 2, so 
that the last x bits of a block-id is used to determine which subdirectory the 
block is stored in. this is the scheme used in Mike's patch for hadoop-50.

If a datanode is re-configured with different number (or locations) of volumes 
for dfs.data.dir, the blocks stored in earlier locations are considered by the 
datanode to be lost (when in future, the datanode is checkpointed, it will try 
to recover those "lost" blocks). If one of the volumes is read-only, it will 
currently be considered to be dead only with-respect-to that volume. i.e. it 
will still continue to store blocks in read-write volumes, but blocks in the 
read-only volumes will be considered lost, since they cannot be deleted.)

Please comment on this proposal asap, so that I can go ahead with 
implementation.


> DataNode should be capable of managing multiple volumes
> -------------------------------------------------------
>
>                 Key: HADOOP-64
>                 URL: http://issues.apache.org/jira/browse/HADOOP-64
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.2.0
>            Reporter: Sameer Paranjpye
>         Assigned To: Milind Bhandarkar
>            Priority: Minor
>             Fix For: 0.6.0
>
>
> The dfs Datanode can only store data on a single filesystem volume. When a 
> node runs its disks JBOD this means running a Datanode per disk on the 
> machine. While the scheme works reasonably well on small clusters, on larger 
> installations (several 100 nodes) it implies a very large number of Datanodes 
> with associated management overhead in the Namenode.
> The Datanod should be enhanced to be able to handle multiple volumes on a 
> single machine.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to