[
https://issues.apache.org/jira/browse/HDFS-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tao Jie updated HDFS-11638:
---------------------------
Attachment: HDFS-11638-001.patch
> Support marking a datanode dead by DFSAdmin
> -------------------------------------------
>
> Key: HDFS-11638
> URL: https://issues.apache.org/jira/browse/HDFS-11638
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Tao Jie
> Attachments: HDFS-11638-001.patch
>
>
> We have met such a circumstance that:
> Kernal error occured on one slave node, and error message like
> {code}
> Apr 1 08:48:05 xxhdn033 kernel: BUG: soft lockup - CPU#0 stuck for 67s!
> [java:19096]
> Apr 1 08:48:05 xxhdn033 kernel: Modules linked in: bridge stp llc fuse
> autofs4 bonding ipv6 uinput iTCO_wdt iTCO_vendor_support microcode
> power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core joydev
> i2c_i801 i2c_core lpc_ich mfd_core sg ses enclosure ixgbe dca ptp pps_core
> mdio ext4 jbd2 mbcache sd_mod crc_t10dif ahci megaraid_sas dm_mirror
> dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
> {code}
> The datanode process was still alive and continued to send heartbeat to the
> namenode, but it could not response any command to this node and reading or
> writing blocks on this datanode would fail. As a result, request to the HDFS
> would be slower since too many read/write timeout.
> We try to walk around this case by adding a dfsadmin command that mark such a
> abnormal datanode as dead by force until it get restarted. When this case
> happens again, it would avoid the client to access the error datanode.
> Any thought?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]