[
https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711079#comment-16711079
]
qian han commented on SPARK-26265:
----------------------------------
# There are hundreds of thousand application running on our cluster per day.
And this deadlock is happened only once. This cannot be reproduce easily.
# I ran spark sql. INSERT OVERWRITE TABLE dm_abtest.rpt_live_tag_metric_daily
PARTITION(date='20181129_bak') select vid, tag_name, tag_value, count(*)
impr_user, avg(impr) impr_per_u, stddev_pop(impr) var_impr_per_u, avg(read)
read_per_u, stddev_pop(read) var_read_per_u, avg(stay) stay_per_u,
stddev_pop(stay) var_stay_per_u, sum(stay)/sum(read) stay_per_r,
sum(read)/sum(impr) read_per_i, avg(finish) finish_per_u, stddev_pop(finish)
var_finish_per_u from ( select vid, user_uid, user_uid_type, tag_name,
tag_value, sum(impr) impr, sum(read) read, sum(stay) stay, sum(stay_count)
stay_count, 0 finish from ( select
transform(vid,user_uid,user_uid_type,tags,impr,read,stay,stay_count) USING
'python transform.py 111111' AS
(vids,user_uid,user_uid_type,tag_name,tag_value,impr,read,stay,stay_count) from
( SELECT vid, user_uid, user_uid_type, tags, count(*) impr, sum(all_read) read,
sum(video_stay) stay, sum(if(video_stay>0, 1, 0)) stay_count FROM
dm_abtest.stg_live_impression_stats_daily WHERE date='20181129' and vid <> ''
GROUP BY vid,user_uid,user_uid_type,tags ) t distribute by
vids,user_uid,user_uid_type,tag_name,tag_value ) t lateral view
explode(split(vids, ',')) b as vid group by
vid,user_uid,user_uid_type,tag_name,tag_value ) t group by
vid,tag_name,tag_value
# When deadlock happen, the executor hang and do nothing.
> deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
> ------------------------------------------------------------------
>
> Key: SPARK-26265
> URL: https://issues.apache.org/jira/browse/SPARK-26265
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.2
> Reporter: qian han
> Priority: Major
>
> The application is running on a cluster with 72000 cores and 182000G mem.
> Enviroment:
> |spark.dynamicAllocation.minExecutors|5|
> |spark.dynamicAllocation.initialExecutors|30|
> |spark.dynamicAllocation.maxExecutors|400|
> |spark.executor.cores|4|
> |spark.executor.memory|20g|
>
>
> Stage description:
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422)
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357)
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193)
>
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> jstack information as follow:
> Found one Java-level deadlock: =============================
> "Thread-ScriptTransformation-Feed": waiting to lock monitor
> 0x0000000000e0cb18 (object 0x00000002f1641538, a
> org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task
> launch worker for task 18899" "Executor task launch worker for task 18899":
> waiting to lock monitor 0x0000000000e09788 (object 0x0000000302faa3b0, a
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by
> "Thread-ScriptTransformation-Feed" Java stack information for the threads
> listed above: ===================================================
> "Thread-ScriptTransformation-Feed": at
> org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332)
> - waiting to lock <0x00000002f1641538> (a
> org.apache.spark.memory.TaskMemoryManager) at
> org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at
> org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66)
> at
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274)
> - locked <0x0000000302faa3b0> (a
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313)
> at
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source) at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:281)
> at
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270)
> at
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1995) at
> org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270)
> "Executor task launch worker for task 18899": at
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.spill(BytesToBytesMap.java:345)
> - waiting to lock <0x0000000302faa3b0> (a
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at
> org.apache.spark.unsafe.map.BytesToBytesMap.spill(BytesToBytesMap.java:772)
> at
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:180)
> - locked <0x00000002f1641538> (a org.apache.spark.memory.TaskMemoryManager)
> at
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:283)
> at
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:117)
> at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:371)
> at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:394)
> at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:267)
> at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at
> org.apache.spark.scheduler.Task.run(Task.scala:109) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748) Found 1 deadlock.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]