[ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16417519#comment-16417519
 ] 

Nathan Kleyn edited comment on SPARK-23801 at 3/28/18 3:14 PM:
---------------------------------------------------------------

[~kiszk] Sure, we'll have our best go at trying to narrow down the job to 
something that fails - many thanks for the speedy reply!

Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?


was (Author: nathankleyn):
[~kiszk] We'll have our best go at trying to narrow down the job to something 
that fails. Since upgrading to Spark v2.3.0 a lot of the stages are listed as 
"ThreadPoolExecutor" now, meaning it's really difficult to actually see what 
they correspond to in the source. As this job is failing around the 60th stage, 
that's a lot of things it could be! Is there any advice on the best way to 
correlate these "ThreadPoolExecutor" stages with the source code now?

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --------------------------------------------------
>
>                 Key: SPARK-23801
>                 URL: https://issues.apache.org/jira/browse/SPARK-23801
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0
>         Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>            Reporter: Nathan Kleyn
>            Priority: Major
>         Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f1467427fdc, pid=1315, tid=0x00007f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---------------  T H R E A D  ---------------
> Current thread (0x00007f146005b000):  GCTaskThread [stack: 
> 0x00007f1464e2d000,0x00007f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x0000000000000000
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x00007ef9c035f8c8, RCX=0x00007f1464f2c9f0, 
> RDX=0x0000000000000000
> RSP=0x00007f1464f2c1a0, RBP=0x00007f1464f2c210, RSI=0x0000000000000068, 
> RDI=0x00007ef7bc30bda8
> R8 =0x00007f1464f2c3d0, R9 =0x0000000000001741, R10=0x00007f1467a52819, 
> R11=0x00007f14671240e0
> R12=0x00007f130912c998, R13=0x17e907feccbc6d20, R14=0x0000000000000002, 
> R15=0x000000000000000d
> RIP=0x00007f1467427fdc, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, 
> ERR=0x0000000000000000
>   TRAPNO=0x000000000000000d
> Top of Stack: (sp=0x00007f1464f2c1a0)
> 0x00007f1464f2c1a0:   00007f146005b000 0000000000000001
> 0x00007f1464f2c1b0:   0000000000000004 00007f14600bb640
> 0x00007f1464f2c1c0:   00007f1464f2c210 00007f14673aeed6
> 0x00007f1464f2c1d0:   00007f1464f2c2c0 00007f1464f2c250
> 0x00007f1464f2c1e0:   00007f11bde31b70 00007ef9c035f8c8
> 0x00007f1464f2c1f0:   00007ef8a80a7060 0000000000001741
> 0x00007f1464f2c200:   0000000000000002 00000000ffffffff
> 0x00007f1464f2c210:   00007f1464f2c230 00007f146742b005
> 0x00007f1464f2c220:   00007ef8a80a7050 0000000000001741
> 0x00007f1464f2c230:   00007f1464f2c2d0 00007f14673ae9fb
> 0x00007f1464f2c240:   00007f1467a5d880 00007f14673ad9a0
> 0x00007f1464f2c250:   00007f1464f2c9f0 00007f1464f2c3d0
> 0x00007f1464f2c260:   00007f1464f2c3a0 00007f146005b620
> 0x00007f1464f2c270:   00007ef8b843d7c8 ffff000200000006
> 0x00007f1464f2c280:   00007f1464f2c340 00007f14600bb640
> 0x00007f1464f2c290:   17417f1453fb9cec 00007f1453fbffff
> 0x00007f1464f2c2a0:   00007f1453fb819e 00007f1464f2c3a0
> 0x00007f1464f2c2b0:   0000000000000001 0000000000000000
> 0x00007f1464f2c2c0:   00007f1464f2c3d0 00007f1464f2c9d0
> 0x00007f1464f2c2d0:   00007f1464f2c340 00007f1467025f22
> 0x00007f1464f2c2e0:   00007f145427cb5c 00007f1464f2c3a0
> 0x00007f1464f2c2f0:   00007f1464f2c370 00007f146005b000
> 0x00007f1464f2c300:   00007f1464f2c9f0 00007ef850009800
> 0x00007f1464f2c310:   00007f1464f2c9f0 00007f1464f2c3a0
> 0x00007f1464f2c320:   00007f1464f2c3d0 00007f146005b000
> 0x00007f1464f2c330:   00007f1464f2c9f0 00007ef850009800
> 0x00007f1464f2c340:   00007f1464f2c9c0 00007f1467508191
> 0x00007f1464f2c350:   00007ef9c16f7890 00007f1464f2c370
> 0x00007f1464f2c360:   00007f1464f2c9d0 0000000000000000
> 0x00007f1464f2c370:   00007ef9c035f8c0 00007f145427cb5c
> 0x00007f1464f2c380:   00007f145427ba90 00007ef900000000
> 0x00007f1464f2c390:   0000000000000078 00007ef9c035f8c0 
> Instructions: (pc=0x00007f1467427fdc)
> 0x00007f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x00007f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x00007f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
> 0x00007f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 
> Register to memory mapping:
> RAX=0x17e907feccbc6d20 is an unknown value
> RBX=0x00007ef9c035f8c8 is pointing into the stack for thread: 
> 0x00007ef850009800
> RCX=0x00007f1464f2c9f0 is an unknown value
> RDX=0x0000000000000000 is an unknown value
> RSP=0x00007f1464f2c1a0 is an unknown value
> RBP=0x00007f1464f2c210 is an unknown value
> RSI=0x0000000000000068 is an unknown value
> RDI=0x00007ef7bc30bda8 is pointing into metadata
> R8 =0x00007f1464f2c3d0 is an unknown value
> R9 =0x0000000000001741 is an unknown value
> R10=0x00007f1467a52819: <offset 0xfc0819> in 
> /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 
> 0x00007f1466a92000
> R11=0x00007f14671240e0: <offset 0x6920e0> in 
> /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 
> 0x00007f1466a92000
> R12=0x00007f130912c998 is an oop
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage50
>  
>  - klass: 
> 'org/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage50'
> R13=0x17e907feccbc6d20 is an unknown value
> R14=0x0000000000000002 is an unknown value
> R15=0x000000000000000d is an unknown value
> Stack: [0x00007f1464e2d000,0x00007f1464f2e000],  sp=0x00007f1464f2c1a0,  free 
> space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
> V  [libjvm.so+0x999005]  PSRootsClosure<false>::do_oop(oopDesc**)+0x35
> V  [libjvm.so+0x91c9fb]  OopMapSet::all_do(frame const*, RegisterMap const*, 
> OopClosure*, void (*)(oopDesc**, oopDesc**), OopClosure*)+0x2fb
> V  [libjvm.so+0x593f22]  frame::oops_do_internal(OopClosure*, CLDClosure*, 
> CodeBlobClosure*, RegisterMap*, bool)+0xa2
> V  [libjvm.so+0xa76191]  JavaThread::oops_do(OopClosure*, CLDClosure*, 
> CodeBlobClosure*)+0x161
> V  [libjvm.so+0x99926f]  ThreadRootsTask::do_it(GCTaskManager*, unsigned 
> int)+0x6f
> V  [libjvm.so+0x5dbfef]  GCTaskThread::run()+0x12f
> V  [libjvm.so+0x92da28]  java_start(Thread*)+0x108
> JavaThread 0x00007ef850009800 (nid = 1558) was being processed
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> J 2336  sun.misc.Unsafe.putLong(Ljava/lang/Object;JJ)V (0 bytes) @ 
> 0x00007f14518c70cc [0x00007f14518c7080+0x4c]
> J 20102 C2 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage50.processNext()V
>  (1030 bytes) @ 0x00007f145427cb5c [0x00007f145427c020+0xb3c]
> J 9304 C2 scala.collection.Iterator$$anon$11.hasNext()Z (10 bytes) @ 
> 0x00007f145280da10 [0x00007f145280d460+0x5b0]
> J 15346 C2 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V
>  (117 bytes) @ 0x00007f145227172c [0x00007f1452271680+0xac]
> J 16755 C1 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;
>  (293 bytes) @ 0x00007f14534a1dbc [0x00007f145349f820+0x259c]
> J 16754 C1 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;
>  (6 bytes) @ 0x00007f14536cf5cc [0x00007f14536cf540+0x8c]
> J 15858 C1 
> org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;
>  (399 bytes) @ 0x00007f1452eccd44 [0x00007f1452eca8a0+0x24a4]
> J 16786 C1 org.apache.spark.executor.Executor$TaskRunner.run()V (2984 bytes) 
> @ 0x00007f1453a4c97c [0x00007f1453a495e0+0x339c]
> J 18919 C1 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x00007f1453fb91cc [0x00007f1453fb81c0+0x100c]
> j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub{code}
> Unfortunately, this job is so large that it's pretty impossible for us to 
> narrow down to a reproducible test case. What I can say though is that:
>  * We are running on Mesos using coarse grained scheduling.
>  * We can make it fail every time, consistently.
>  * It only happened after we upgraded to v2.3.0.
>  * All inputs and options to the job are _exactly_ the same before as after.
> Please let me know if we can provide any other information!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to