[
https://issues.apache.org/jira/browse/SPARK-13514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208387#comment-15208387
]
Satish Kolli commented on SPARK-13514:
--------------------------------------
I see an exception. But also a message that "Started YARN shuffle service for
Spark on port 7337". I do have directory "/data/drive01/hadoop/yarn/nm" on node
managers and 'yarn' user can write to it.
{code}
2016-03-22 16:09:29,100 INFO org.apache.spark.network.yarn.YarnShuffleService:
Initializing YARN shuffle service for Spark
2016-03-22 16:09:29,100 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding
auxiliary service spark_shuffle, "spark_shuffle"
2016-03-22 16:09:29,255 ERROR
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error opening
leveldb file file:/data/drive01/hadoop/yarn/nm/registeredExecutors.ldb.
Creating new file, will not be able to recover state for existing applications
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error:
/opt/hadoop-2.5.1/file:/data/drive01/hadoop/yarn/nm/registeredExecutors.ldb/LOCK:
No such file or directory
at
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:100)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:81)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:56)
at
org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:128)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:223)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:234)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:425)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:472)
2016-03-22 16:09:29,256 WARN
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver: error deleting
file:/data/drive01/hadoop/yarn/nm/registeredExecutors.ldb
2016-03-22 16:09:29,256 ERROR org.apache.spark.network.yarn.YarnShuffleService:
Failed to initialize external shuffle service
java.io.IOException: Unable to create state store
at
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:129)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:81)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.<init>(ExternalShuffleBlockHandler.java:56)
at
org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:128)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:223)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:234)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:425)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:472)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error:
/opt/hadoop-2.5.1/file:/data/drive01/hadoop/yarn/nm/registeredExecutors.ldb/LOCK:
No such file or directory
at
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at
org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.<init>(ExternalShuffleBlockResolver.java:127)
... 14 more
2016-03-22 16:09:29,386 INFO org.apache.spark.network.yarn.YarnShuffleService:
Started YARN shuffle service for Spark on port 7337. Authentication is not
enabled. Registered executor file is
file:/data/drive01/hadoop/yarn/nm/registeredExecutors.ldb
{code}
> Spark Shuffle Service 1.6.0 issue in Yarn
> ------------------------------------------
>
> Key: SPARK-13514
> URL: https://issues.apache.org/jira/browse/SPARK-13514
> Project: Spark
> Issue Type: Bug
> Reporter: Satish Kolli
>
> Spark shuffle service 1.6.0 in Yarn fails with an unknown exception. When I
> replace the spark shuffle jar with version 1.5.2 jar file, the following
> succeeds with out any issues.
> Hadoop Version: 2.5.1 (Kerberos Enabled)
> Spark Version: 1.6.0
> Java Version: 1.7.0_79
> {code}
> $SPARK_HOME/bin/spark-shell \
> --master yarn \
> --deploy-mode client \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.minExecutors=5 \
> --conf spark.yarn.executor.memoryOverhead=2048 \
> --conf spark.shuffle.service.enabled=true \
> --conf spark.scheduler.mode=FAIR \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --executor-memory 6G \
> --driver-memory 8G
> {code}
> {code}
> scala> val df = sc.parallelize(1 to 50).toDF
> df: org.apache.spark.sql.DataFrame = [_1: int]
> scala> df.show(50)
> {code}
> {code}
> 16/02/26 08:20:53 INFO spark.SparkContext: Starting job: show at <console>:30
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Got job 0 (show at
> <console>:30) with 1 output partitions
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 0
> (show at <console>:30)
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Parents of final stage: List()
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Missing parents: List()
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[2] at show at <console>:30), which has no missing parents
> 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0 stored as
> values in memory (estimated size 2.2 KB, free 2.2 KB)
> 16/02/26 08:20:53 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
> as bytes in memory (estimated size 1411.0 B, free 3.6 KB)
> 16/02/26 08:20:53 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.5.76.106:46683 (size: 1411.0 B, free: 5.5 GB)
> 16/02/26 08:20:53 INFO spark.SparkContext: Created broadcast 0 from broadcast
> at DAGScheduler.scala:1006
> 16/02/26 08:20:53 INFO scheduler.DAGScheduler: Submitting 1 missing tasks
> from ResultStage 0 (MapPartitionsRDD[2] at show at <console>:30)
> 16/02/26 08:20:53 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
> 16/02/26 08:20:53 INFO scheduler.FairSchedulableBuilder: Added task set
> TaskSet_0 tasks to pool default
> 16/02/26 08:20:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
> 0.0 (TID 0, XXXXXXXXXXXXXXXXXXXXXXXX, partition 0,PROCESS_LOCAL, 2031 bytes)
> 16/02/26 08:20:53 INFO cluster.YarnClientSchedulerBackend: Disabling executor
> 2.
> 16/02/26 08:20:54 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 0)
> 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Trying to remove
> executor 2 from BlockManagerMaster.
> 16/02/26 08:20:54 INFO storage.BlockManagerMasterEndpoint: Removing block
> manager BlockManagerId(2, XXXXXXXXXXXXXXXXXXXXXXXX, 48113)
> 16/02/26 08:20:54 INFO storage.BlockManagerMaster: Removed 2 successfully in
> removeExecutor
> 16/02/26 08:20:54 ERROR cluster.YarnScheduler: Lost executor 2 on
> XXXXXXXXXXXXXXXXXXXXXXXX: Container marked as failed:
> container_1456492687549_0001_01_000003 on host: XXXXXXXXXXXXXXXXXXXXXXXX.
> Exit status: 1. Diagnostics: Exception from container-launch:
> ExitCodeException exitCode=1:
> ExitCodeException exitCode=1:
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
> at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]