Re: DataNode Drive failure. (Tails from the front lines)

Koji Noguchi Fri, 07 Aug 2009 16:03:49 -0700

You probably want this.
https://issues.apache.org/jira/browse/HDFS-457


Koji

> As we can see here the data node shutdown. Should one disk entering a
> Read Only state really shut down the entire datanode? The datanode
> happily restarts once the disk was unmounted. just wondering?


On 8/7/09 12:11 PM, "Edward Capriolo" <edlinuxg...@gmail.com> wrote:

> I have a hadoop 18.3 cluster. Today I got two nagios alerts. Actually,
> I was excited by one of them because I never had a way to test
> check_hpacucli. Worked first time NIIIIIIICCEEEEEEEE. (Borat voice.)
> -------
> Notification Type: PROBLEM
> 
> Service: check_remote_datanode
> Host: nyhadoopdata10.ops.jointhegrid.com
> Address: 10.12.9.20
> State: CRITICAL
> 
> Date/Time: Fri Aug 7 18:24:58 GMT 2009
> 
> Additional Info:
> 
> Connection refused
> 
> ------------------------------------------------------------------------------
> ----------------------------------
> 
> Notification Type: PROBLEM
> 
> Service: check_hpacucli
> Host: nyhadoopdata10.ops.jointhegrid.com
> Address: 10.12.9.20
> State: CRITICAL
> 
> Date/Time: Fri Aug 7 18:23:18 GMT 2009
> 
> Additional Info:
> 
> CRITICAL Smart Array P400 in Unknown Slot OK/OK/- (LD 1: OK [(1I:1:1
> OK)] LD 2: OK [(1I:1:2 OK)] LD 3: OK [(1I:1:3 OK)] LD 4: OK [(1I:1:4
> OK)] LD 5: OK [(2I:1:5 OK)] LD 6: OK [(2I:1:6 OK)] LD 7: Failed
> [(2I:1:7 Failed)] LD 8: OK [(2I:1:8 OK)])
> 
> /usr/sbin/hpacucli ctrl all show config detail
> 
>  physicaldrive 2I:1:7
>          Port: 2I
>          Box: 1
>          Bay: 7
>          Status: Failed
>          Drive Type: Data Drive
>          Interface Type: SAS
>          Size: 1TB
>          Rotational Speed: 7200
>          Firmware Revision: HPD1
>          Serial Number: XXXXXXXXXXXXXXXXXXXXXXXX
>          Model: HP      DB1000BABFF
>          PHY Count: 2
>          PHY Transfer Rate: 3.0GBPS, Unknown
> 
> The drive was labeled as failed. It was in a READ ONLY state and
> sections of the drive would produce an IO error while attempting to
> read.
> 
> 
> The datanode logged.
> 2009-08-07 14:20:03,153 WARN org.apache.hadoop.dfs.DataNode: DataNode
> is shutting down.
> directory is not writable: /mnt/disk6/dfs/data/current
> 2009-08-07 14:20:03,244 INFO org.apache.hadoop.dfs.DataBlockScanner:
> Exiting DataBlockScanner thread.
> 2009-08-07 14:20:03,430 INFO org.apache.hadoop.dfs.DataNode:
> writeBlock blk_-2520177395705282298_448274 received exception
> java.io.IOException: Read-only file system
> 2009-08-07 14:20:03,430 ERROR org.apache.hadoop.dfs.DataNode:
> DatanodeRegistration(10.12.9.20:50010,
> storageID=DS-520493036-10.12.9.20-50010-1239749144772, infoPort=50075,
> ipcPort=50020):DataXceiver:
>  java.io.IOException: Read-only file system
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:883)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSVolume.createTmpFile(FSDataset.java:388)
>         at 
> org.apache.hadoop.dfs.FSDataset$FSVolume.createTmpFile(FSDataset.java:359)
>         at org.apache.hadoop.dfs.FSDataset.createTmpFile(FSDataset.java:1050)
>         at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:983)
>         at 
> org.apache.hadoop.dfs.DataNode$BlockReceiver.<init>(DataNode.java:2382)
>         at 
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1234)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1092)
>         at java.lang.Thread.run(Thread.java:619)
> 
> 2009-08-07 14:20:32,974 INFO org.apache.hadoop.dfs.DataNode:
> DatanodeRegistration(10.12.9.20:50010,
> storageID=DS-520493036-10.12.9.20-50010-1239749144772, infoPort=50075,
> ipcPort=50020):Finishing Dat
> aNode in: 
> FSDataset{dirpath='/mnt/disk0/dfs/data/current,/mnt/disk1/dfs/data/current,/mn
> t/disk2/dfs/data/current,/mnt/disk3/dfs/data/current,/mnt/disk4/dfs/data/curre
> nt,/mnt/disk5/dfs/data/current,/m
> nt/disk6/dfs/data/current,/mnt/disk7/dfs/data/current'}
> 2009-08-07 14:20:32,974 INFO org.mortbay.util.ThreadedServer: Stopping
> Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50075]
> 2009-08-07 14:20:32,977 INFO org.mortbay.http.SocketListener: Stopped
> SocketListener on 0.0.0.0:50075
> 2009-08-07 14:20:33,155 INFO org.mortbay.util.Container: Stopped
> HttpContext[/static,/static]
> 2009-08-07 14:20:33,293 INFO org.mortbay.util.Container: Stopped
> HttpContext[/logs,/logs]
> 2009-08-07 14:20:33,293 INFO org.mortbay.util.Container: Stopped
> org.mortbay.jetty.servlet.webapplicationhand...@7444f787
> 2009-08-07 14:20:33,432 INFO org.mortbay.util.Container: Stopped
> WebApplicationContext[/,/]
> 2009-08-07 14:20:33,432 INFO org.mortbay.util.Container: Stopped
> org.mortbay.jetty.ser...@b035079
> 2009-08-07 14:20:33,432 INFO org.apache.hadoop.ipc.Server: Stopping
> server on 50020
> 2009-08-07 14:20:33,432 INFO org.apache.hadoop.ipc.Server: Stopping
> IPC Server Responder
> 2009-08-07 14:20:33,432 INFO org.apache.hadoop.dfs.DataNode: Waiting
> for threadgroup to exit, active threads is 0
> 2009-08-07 14:20:33,432 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 50020: exiting
> 2009-08-07 14:20:33,432 INFO org.apache.hadoop.ipc.Server: Stopping
> IPC Server listener on 50020
> 2009-08-07 14:20:33,432 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 50020: exiting
> 2009-08-07 14:20:33,432 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 50020: exiting
> 2009-08-07 14:20:33,434 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down DataNode at nyhadoopdata10/10.12.9.20
> ************************************************************/
> 
> As we can see here the data node shutdown. Should one disk entering a
> Read Only state really shut down the entire datanode? The datanode
> happily restarts once the disk was unmounted. just wondering?

Re: DataNode Drive failure. (Tails from the front lines)

Reply via email to