[ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784475#comment-16784475
 ] 

Martin Studer commented on SPARK-20415:
---------------------------------------

In fact, I believe the underlying issue is actually SPARK-21657. To be 
confirmed.

> SPARK job hangs while writing DataFrame to HDFS
> -----------------------------------------------
>
>                 Key: SPARK-20415
>                 URL: https://issues.apache.org/jira/browse/SPARK-20415
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, YARN
>    Affects Versions: 2.1.0
>         Environment: EMR 5.4.0
>            Reporter: P K
>            Priority: Major
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x00007f6210352800 nid=0x4542 runnable [0x00007f61f52b3000]
>    java.lang.Thread.State: RUNNABLE
>     at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
>     at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
>     at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
>     at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>     at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>     at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>     at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>     at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>     at org.apache.spark.scheduler.Task.run(Task.scala:99)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at 
> NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 
> (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has 
> no missing parents
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9 stored as values in 
> memory (estimated size 170.5 KB, free 2.2 GB)
> 17/04/18 01:41:14 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes 
> in memory (estimated size 65.2 KB, free 2.2 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xxx.xx:37907 (size: 65.2 KB, free: 2.2 GB)
> 17/04/18 01:41:14 INFO SparkContext: Created broadcast 9 from broadcast at 
> DAGScheduler.scala:996
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting 9 missing tasks from 
> ResultStage 4 (MapPartitionsRDD[31] at save at 
> NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO YarnScheduler: Adding task set 4.0 with 9 tasks
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 
> 72, xxx.xxx.xx.xx.xx, executor 12, partition 1, NODE_LOCAL, 8184 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 
> 73, xxx.xxx.xx.xx.xx, executor 13, partition 0, NODE_LOCAL, 7967 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID 
> 74, xxx.xxx.xx.xx.xx, executor 14, partition 2, NODE_LOCAL, 8181 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 6.0 in stage 4.0 (TID 
> 75, xxx.xxx.xx.xx.xx, executor 16, partition 6, NODE_LOCAL, 8400 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 7.0 in stage 4.0 (TID 
> 76, xxx.xxx.xx.xx.xx, executor 10, partition 7, NODE_LOCAL, 8398 bytes)
> 17/04/18 01:41:14 INFO TaskSetManager: Starting task 3.0 in stage 4.0 (TID 
> 77, xxx.xxx.xx.xx.xx, executor 11, partition 3, NODE_LOCAL, 8182 bytes)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:46030 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:40494 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:35861 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:34157 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:43202 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:46053 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:14 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:46030 (size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO ExecutorAllocationManager: Requesting 9 new executors 
> because tasks are backlogged (new desired total will be 9)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:34157 (size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:40494 (size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:35861 (size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:46053 (size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:15 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:43202 (size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:17 INFO TaskSetManager: Starting task 4.0 in stage 4.0 (TID 
> 78, xxx.xxx.xx.xx.xx, executor 15, partition 4, RACK_LOCAL, 8400 bytes)
> 17/04/18 01:41:17 INFO TaskSetManager: Starting task 5.0 in stage 4.0 (TID 
> 79, xxx.xxx.xx.xx.xx, executor 9, partition 5, RACK_LOCAL, 8400 bytes)
> 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:34045 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:17 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory 
> on xxx.xxx.xx.xx.xx:43887 (size: 65.2 KB, free: 4.0 GB)
> 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:34045 (size: 28.0 KB, free: 4.0 GB)
> 17/04/18 01:41:18 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory 
> on xxx.xxx.xx.xx.xx:43887 (size: 28.0 KB, free: 4.0 GB)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to