Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Yash Sharma Sun, 10 Apr 2016 20:53:00 -0700

Hi Yucai,
Thanks for the info. I have explored the container logs but did not get lot
of information from it.


I have seen this error log for few containers but not sure of the cause for
it.
1. java.lang.NullPointerException (DiskBlockManager.scala:167)
2. java.lang.ClassCastException: RegisterExecutorFailed

Attaching the log for reference.


16/04/07 13:05:43 INFO storage.MemoryStore: MemoryStore started with
> capacity 2.6 GB
> 16/04/07 13:05:43 INFO executor.CoarseGrainedExecutorBackend: Connecting
> to driver: akka.tcp://
> sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler
> 16/04/07 13:05:43 ERROR executor.CoarseGrainedExecutorBackend: Cannot
> register with driver: akka.tcp://
> sparkDriver@10.65.224.199:44692/user/CoarseGrainedScheduler
> java.lang.ClassCastException: Cannot cast
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisterExecutorFailed
> to
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RegisteredExecutor$
>         at java.lang.Class.cast(Class.java:3186)
>         at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
>         at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
>         at scala.util.Try$.apply(Try.scala:161)
>         at scala.util.Success.map(Try.scala:206)
>         at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>         at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>         at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>         at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
>         at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>         at
> scala.concurrent.impl.Promise$KeptPromise.onComplete(Promise.scala:333)
>         at
> scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:254)
>         at
> scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>         at
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>         at
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>         at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>         at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
>         at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
>         at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:935)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>         at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 16/04/07 13:05:44 INFO storage.DiskBlockManager: Shutdown hook called
> 16/04/07 13:05:44 ERROR util.Utils: Uncaught exception in thread Thread-2
> java.lang.NullPointerException
>         at org.apache.spark.storage.DiskBlockManager.org
> $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:167)
>         at
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:149)
>         at
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
>         at scala.util.Try$.apply(Try.scala:161)
>         at
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
>         at
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
>         at
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> 16/04/07 13:05:44 INFO util.ShutdownHookManager: Shutdown hook called


On Mon, Apr 11, 2016 at 1:10 PM, Yu, Yucai <yucai...@intel.com> wrote:

> Hi Yash,
>
>
>
> How about checking the executor(yarn container) log? Most of time, it
> shows more details, we are using CDH, the log is at:
>
>
>
> [yucai@sr483 container_1457699919227_0094_01_000014]$ pwd
>
>
> /mnt/DP_disk1/yucai/yarn/logs/application_1457699919227_0094/container_1457699919227_0094_01_000014
>
> [yucai@sr483 container_1457699919227_0094_01_000014]$ ls -tlr
>
> total 408
>
> -rw-r--r-- 1 yucai DP 382676 Mar 13 18:04 stderr
>
> -rw-r--r-- 1 yucai DP  22302 Mar 13 18:04 stdout
>
>
>
> Please pay attention, you had better check the first failure container .
>
>
>
> Thanks,
>
> Yucai
>
>
>
> *From:* Yash Sharma [mailto:yash...@gmail.com]
> *Sent:* Monday, April 11, 2016 10:46 AM
> *To:* dev@spark.apache.org
> *Subject:* Spark Sql on large number of files (~500Megs each) fails after
> couple of hours
>
>
>
> Hi All,
>
> I am trying Spark Sql on a dataset ~16Tb with large number of files
> (~50K). Each file is roughly 400-500 Megs.
>
>
>
> I am issuing a fairly simple hive query on the dataset with just filters
> (No groupBy's and Joins) and the job is very very slow. It runs for 7-8 hrs
> and processes about 80-100 Gigs on a 12 node cluster.
>
>
>
> I have experimented with different values of spark.sql.shuffle.partitions
> from 20 to 4000 but havn't seen lot of difference.
>
>
>
> From the logs I have the yarn error attached at end [1]. I have got the
> below spark configs [2] for the job.
>
>
>
> Is there any other tuning I need to look into. Any tips would be
> appreciated,
>
>
>
> Thanks
>
>
>
>
>
> 2. Spark config -
>
> spark-submit
>
> --master yarn-client
>
> --driver-memory 1G
>
> --executor-memory 10G
>
> --executor-cores 5
>
> --conf spark.dynamicAllocation.enabled=true
>
> --conf spark.shuffle.service.enabled=true
>
> --conf spark.dynamicAllocation.initialExecutors=2
>
> --conf spark.dynamicAllocation.minExecutors=2
>
>
>
>
>
> 1. Yarn Error:
>
>
> 16/04/07 13:05:37 INFO yarn.YarnAllocator: Container marked as failed:
> container_1459747472046_1618_02_000003. Exit status: 1. Diagnostics:
> Exception from container-launch.
> Container id: container_1459747472046_1618_02_000003
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1:
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>         at org.apache.hadoop.util.Shell.run(Shell.java:455)
>         at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>         at
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> Container exited with a non-zero exit code 1
>
>

Re: Spark Sql on large number of files (~500Megs each) fails after couple of hours

Reply via email to