GitHub user LiShuMing opened a pull request:

    https://github.com/apache/spark/pull/18905

    [SPARK-21660] [YARN] [Shuffle] Yarn ShuffleService failed to start when the 
chosen dir…

    
    ## What changes were proposed in this pull request?
    
    See [SPARK-21660](https://issues.apache.org/jira/browse/SPARK-21660), this 
PR add one simple strategy to validate the chosen disk writable to avoid 
choosing a read-only disk.
    
    ## How was this patch tested?
    
    #### How to mock disk corrupted?
    > change the recovery path read-only mode: 
    > sudo chmod -R 400 
/var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle
    
    Before this pr, when we start the nodemanager, exception below:
    
    > 2017-08-10 16:30:08,112 INFO  yarn.YarnShuffleService 
(YarnShuffleService.java:<init>(136)) - Initializing YARN shuffle service for 
Spark
    2017-08-10 16:30:08,112 INFO  containermanager.AuxServices 
(AuxServices.java:addService(72)) - Adding auxiliary service spark_shuffle, 
"spark_shuffle"
    2017-08-10 16:30:08,218 ERROR util.LevelDBProvider 
(LevelDBProvider.java:initLevelDB(61)) - error opening leveldb file 
/var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb.
  Creating new file, will not be able to recover state for existing applications
    org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
/var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb/LOCK:
 Permission denied
            at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
            at 
org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
            at 
org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
            at 
org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:48)
            at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:116)
            at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:94)
            at 
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:66)
            at 
org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:261)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:495)
            at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
    2017-08-10 16:30:08,220 WARN  util.LevelDBProvider 
(LevelDBProvider.java:initLevelDB(71)) - error deleting 
/var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb
    2017-08-10 16:30:08,220 INFO  service.AbstractService 
(AbstractService.java:noteFailure(272)) - Service spark_shuffle failed in state 
INITED; cause: java.io.IOException: Unable to create state store
    java.io.IOException: Unable to create state store
            at 
org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:77)
            at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:116)
            at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:94)
            at 
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:66)
            at 
org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:167)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:245)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
            at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:261)
            at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
            at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:495)
            at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
    Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO 
error: 
/var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle/registeredExecutors.ldb/LOCK:
 Permission denied
            at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
            at 
org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
            at 
org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
            at 
org.apache.spark.network.util.LevelDBProvider.initLevelDB(LevelDBProvider.java:75)
            ... 15 more
    
    
    After this pr:
    
        
    
    > 2017-08-10 16:36:49,101 INFO  yarn.YarnShuffleService 
(YarnShuffleService.java:<init>(136)) - Initializing YARN shuffle service for 
Spark
    2017-08-10 16:36:49,101 INFO  containermanager.AuxServices 
(AuxServices.java:addService(72)) - Adding auxiliary service spark_shuffle, 
"spark_shuffle"
    2017-08-10 16:36:49,102 INFO  yarn.YarnShuffleService 
(YarnShuffleService.java:initRecoveryDb(359)) - Recovery path 
/var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle 
ldb available: false.
    2017-08-10 16:36:49,102 WARN  yarn.YarnShuffleService 
(YarnShuffleService.java:initRecoveryDb(367)) - Recovery path 
/var/log/hadoop-yarn/nodemanager/recovery-state/nm-aux-services/spark_shuffle 
unavailable: set it to null
    2017-08-10 16:36:49,180 INFO  util.LevelDBProvider 
(LevelDBProvider.java:initLevelDB(51)) - Creating state database at 
/mnt/dfs/0/hadoop/yarn/local/registeredExecutors.ldb
    2017-08-10 16:36:49,317 INFO  util.LevelDBProvider$LevelDBLogger 
(LevelDBProvider.java:log(93)) - Delete type=3 #1
    2017-08-10 16:36:49,548 INFO  yarn.YarnShuffleService 
(YarnShuffleService.java:serviceInit(186)) - Started YARN shuffle service for 
Spark on port 7337. Authentication is not enabled.  Registered executor file is 
/mnt/dfs/0/hadoop/yarn/local/registeredExecutors.ld
    b

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/LiShuMing/spark SPARK-21660

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18905.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18905
    
----
commit d62405dfbbea6ce1e7604721ab1234e5fde5b651
Author: lishuming <alemmont...@126.com>
Date:   2017-08-09T02:45:28Z

    [SPARK-21660] Yarn ShuffleService failed to start when the chosen directory 
become read-only

commit 2077537c52b43c6df050a7afe23a453d09e38db6
Author: lishuming <alemmont...@126.com>
Date:   2017-08-10T08:45:41Z

    Recovery path had already existed but unavailable, set it to null

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to