Ming Ma created HADOOP-11305:
--------------------------------

             Summary: RM might not start if the machine was hard shutdown and 
FileSystemRMStateStore was used
                 Key: HADOOP-11305
                 URL: https://issues.apache.org/jira/browse/HADOOP-11305
             Project: Hadoop Common
          Issue Type: Bug
            Reporter: Ming Ma


This might be a known issue. Given FileSystemRMStateStore isn't used for HA 
scenario, it might not be that important, unless there is something we need to 
fix at RM layer to make it more tolerant to RMStore issue.

When RM was hard shutdown, OS might not get a chance to persist blocks. Some of 
the stored application data end up with size zero after reboot. And RM didn't 
like that.

{noformat}
ls -al 
/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351
total 156
drwxr-xr-x.    2 x y   4096 Nov 13 16:45 .
drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 ..
-rw-r--r--.    1 x y      0 Nov 13 16:45 appattempt_1412702189634_324351_000001
-rw-r--r--.    1 x y      0 Nov 13 16:45 
.appattempt_1412702189634_324351_000001.crc
-rw-r--r--.    1 x y      0 Nov 13 16:45 application_1412702189634_324351
-rw-r--r--.    1 x y      0 Nov 13 16:45 .application_1412702189634_324351.crc
{noformat}


When RM starts up

{noformat}

2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem 
opening checksum file: 
file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351.
  Ignoring exception:
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:197)
        at java.io.DataInputStream.readFully(DataInputStream.java:169)
        at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501)

...

2014-11-13 17:40:48,876 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
load/recover state
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834)

{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to