[
https://issues.apache.org/jira/browse/HDFS-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007330#comment-13007330
]
Wang Xu commented on HDFS-1362:
-------------------------------
@Sanjay,
Most SATA controller support hotswap, and all SATA devices support it. (Ref the
libata wiki on: https://ata.wiki.kernel.org )
And for the operational issue. many servers have per-disk status LED, some of
them could be programming. Thus the management system can identify the failed
disk by it. Without a status identification, it's indeed hard for maintainers
to find the right disks.
My assumption is:
# manually change the disk.
# find new device and enable it, then make local fs on it, and then mount it
and make essential dirs. This step could be done by external management system
or manually.
# re-enable the disk in hadoop
@Bharath,
Thanks for the code review, the recoverTransitionRead and
recoverTransitionAdditionalRead are almost the same except the end "writeAll"
at the end. when we add additional disks, we should not writeAll(). Should we
split the recoverTransitionRead into different parts and re-use them?
> Provide volume management functionality for DataNode
> ----------------------------------------------------
>
> Key: HDFS-1362
> URL: https://issues.apache.org/jira/browse/HDFS-1362
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: data-node
> Affects Versions: 0.23.0
> Reporter: Wang Xu
> Assignee: Wang Xu
> Fix For: 0.23.0
>
> Attachments: DataNode Volume Refreshment in HDFS-1362.pdf,
> HDFS-1362.4_w7001.txt, HDFS-1362.5.patch, HDFS-1362.6.patch,
> HDFS-1362.7.patch, HDFS-1362.txt, Provide_volume_management_for_DN_v1.pdf
>
>
> The current management unit in Hadoop is a node, i.e. if a node failed, it
> will be kicked out and all the data on the node will be replicated.
> As almost all SATA controller support hotplug, we add a new command line
> interface to datanode, thus it can list, add or remove a volume online, which
> means we can change a disk without node decommission. Moreover, if the failed
> disk still readable and the node has enouth space, it can migrate data on the
> disks to other disks in the same node.
> A more detailed design document will be attached.
> The original version in our lab is implemented against 0.20 datanode
> directly, and is it better to implemented it in contrib? Or any other
> suggestion?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira