Tao Jie created HDFS-11638:
------------------------------

             Summary: Support marking a datanode dead by DFSAdmin
                 Key: HDFS-11638
                 URL: https://issues.apache.org/jira/browse/HDFS-11638
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Tao Jie


We have met such a circumstance that:
Kernal error occured on one slave node, and error message like
{code}
Apr 1 08:48:05 xxhdn033 kernel: BUG: soft lockup - CPU#0 stuck for 67s! 
[java:19096]
Apr 1 08:48:05 xxhdn033 kernel: Modules linked in: bridge stp llc fuse autofs4 
bonding ipv6 uinput iTCO_wdt iTCO_vendor_support microcode power_meter 
acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core joydev i2c_i801 i2c_core 
lpc_ich mfd_core sg ses enclosure ixgbe dca ptp pps_core mdio ext4 jbd2 mbcache 
sd_mod crc_t10dif ahci megaraid_sas dm_mirror dm_region_hash dm_log dm_mod 
[last unloaded: speedstep_lib]
{code}
The datanode process was still alive and continued to send heartbeat to the 
namenode, but it could not response any command to this node and reading or 
writing blocks on this datanode would fail. As a result, request to the HDFS 
would be slower since too many read/write timeout.
We try to walk around this case by adding a dfsadmin command that mark such a 
abnormal datanode as dead by force until it get restarted. When this case 
happens again, it would avoid the client to access the error datanode.
Any thought?




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to