[
https://issues.apache.org/jira/browse/SPARK-22458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-22458.
-------------------------------
Resolution: Not A Problem
Please read http://spark.apache.org/contributing.html
This just means you're out of YARN container memory. You need to increase the
overhead size. This is not a bug.
> OutOfDirectMemoryError with Spark 2.2
> -------------------------------------
>
> Key: SPARK-22458
> URL: https://issues.apache.org/jira/browse/SPARK-22458
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, SQL, YARN
> Affects Versions: 2.2.0
> Reporter: Kaushal Prajapati
> Priority: Blocker
>
> We were using Spark 2.1 from last 6 months to execute multiple spark jobs
> that is running 15 hour long for 50+ TB of source data with below
> configurations successfully.
> {quote}spark.master yarn
> spark.driver.cores 10
> spark.driver.maxResultSize 5g
> spark.driver.memory 20g
> spark.executor.cores 5
> spark.executor.extraJavaOptions *-XX:+UseG1GC
> -Dio.netty.maxDirectMemory=1024* -XX:MaxGCPauseMillis=60000
> *-XX:MaxDirectMemorySize=2048m*
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.driver.extraJavaOptions
> *-Dio.netty.maxDirectMemory=2048 -XX:MaxDirectMemorySize=2048m*
> -Dlog4j.configuration=file:///conf/log4j.properties -Dhdp.version=2.5.3.0-37
> spark.executor.instances 30
> spark.executor.memory 30g
> *spark.kryoserializer.buffer.max 512m*
> spark.network.timeout 12000s
> spark.serializer
> org.apache.spark.serializer.KryoSerializer
> spark.shuffle.io.preferDirectBufs false
> spark.sql.catalogImplementation hive
> spark.sql.shuffle.partitions 5000
> spark.yarn.driver.memoryOverhead 1536
> spark.yarn.executor.memoryOverhead 4096
> spark.core.connection.ack.wait.timeout 600s
> spark.scheduler.maxRegisteredResourcesWaitingTime 15s
> spark.sql.hive.filesourcePartitionFileCacheSize 524288000
> spark.dynamicAllocation.executorIdleTimeout 30000s
> spark.dynamicAllocation.enabled true
> spark.hadoop.yarn.timeline-service.enabled false
> spark.shuffle.service.enabled true
> spark.yarn.am.extraJavaOptions *-Dhdp.version=2.5.3.0-37
> -Dio.netty.maxDirectMemory=1024 -XX:MaxDirectMemorySize=1024m*{quote}
> Recently we tried to upgrade from Spark 2.1 to Spark 2.2 to get some fixes
> using latest version. But we started facing DirectBuffer outOfMemory error
> and exceeding memory limits for executor memoryOverhead issue. To fix that we
> started tweaking multiple properties but still issue persists. Relevant
> information is shared below
> Please let me any other details is requried,
>
> Snapshot for DirectMemory Error Stacktrace :-
> {code:java}
> 10:48:26.417 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in
> stage 5.3 (TID 25022, dedwdprshc070.de.xxxxxxx.com, executor 615):
> FetchFailed(BlockManagerId(465, dedwdprshc061.de.xxxxxxx.com, 7337, None),
> shuffleId=7, mapId=141, reduceId=3372, message=
> org.apache.spark.shuffle.FetchFailedException: failed to allocate 65536
> byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)
> at
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:59)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:414)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate
> 65536 byte(s) of direct memory (used: 1073699840, max: 1073741824)
> at
> io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:530)
> at
> io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:484)
> at
> io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
> at
> io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:67)
> at
> io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
> at
> io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:425)
> at
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:299)
> at
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> ... 1 more
> {code}
> if i removed above netty configuration, getting below error
> Snapshot for Excedding memory overhead Stacktrace :-
>
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 3372 in stage 5.0 failed 4 times, most recent failure: Lost task 3372.3
> in stage 5.0 (TID 19534, dedwfprshd006.de.xxxxxxx.com, executor 125):
> ExecutorLostFailure (executor 125 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 37.1 GB of 34
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
> ... 49 more
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]