Tao Jie created HDFS-11638:
------------------------------
Summary: Support marking a datanode dead by DFSAdmin
Key: HDFS-11638
URL: https://issues.apache.org/jira/browse/HDFS-11638
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Tao Jie
We have met such a circumstance that:
Kernal error occured on one slave node, and error message like
{code}
Apr 1 08:48:05 xxhdn033 kernel: BUG: soft lockup - CPU#0 stuck for 67s!
[java:19096]
Apr 1 08:48:05 xxhdn033 kernel: Modules linked in: bridge stp llc fuse autofs4
bonding ipv6 uinput iTCO_wdt iTCO_vendor_support microcode power_meter
acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core joydev i2c_i801 i2c_core
lpc_ich mfd_core sg ses enclosure ixgbe dca ptp pps_core mdio ext4 jbd2 mbcache
sd_mod crc_t10dif ahci megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[last unloaded: speedstep_lib]
{code}
The datanode process was still alive and continued to send heartbeat to the
namenode, but it could not response any command to this node and reading or
writing blocks on this datanode would fail. As a result, request to the HDFS
would be slower since too many read/write timeout.
We try to walk around this case by adding a dfsadmin command that mark such a
abnormal datanode as dead by force until it get restarted. When this case
happens again, it would avoid the client to access the error datanode.
Any thought?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]