[ https://issues.apache.org/jira/browse/SPARK-16549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niels Becker updated SPARK-16549: --------------------------------- Description: I'm submitting my application via spark-submit. It is running a long living Context with many jobs and tasks. For a lot of tasks I get a error message: {quote} 16/07/13 19:46:12 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 1387674 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) {quote} After a while I got erros like: {quote} 16/07/13 19:45:43 ERROR Utils: Uncaught exception in thread task-result-getter-4 java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at java.lang.Class.getConstructor0(Class.java:3082) at java.lang.Class.getConstructor(Class.java:1825) at com.esotericsoftware.kryo.Kryo.newSerializer(Kryo.java:322) at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:303) at com.esotericsoftware.kryo.Kryo.register(Kryo.java:351) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:140) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273) at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:96) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} Finaly in the end the entire JVM crashed: {quote} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f576f13c7d3, pid=1152, tid=140007008368384 # # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-8u91-b14-1~bpo8+1-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x6967d3] # # Core dump written. Default location: /home/notebook/nbdata/core or core.1152 # # An error report file with more information is saved as: # /home/notebook/nbdata/hs_err_pid1152.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # Aborted (core dumped) {quote} Inside my application i have a HiveContext and repeatedly run {{sqlContext.read.json(...).groupBy(...).count.collect}} which gives around 10 results from 200 million raw json records input. On my 20 node cluster this spins up ~42000 Tasks for each run. My coding does not store as many data that would cause a driver with 8GB memory go out of memory. So I assume something inside Spark does not cleanup finished tasks correctly or that this strange Error, that ignores an update causes the memory leak. {code} val conf = new SparkConf().setAppName("Benchmark ") val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val write = new java.io.PrintWriter(new java.io.FileOutputStream(outFile, true)) for(i <- 1 to count) { val startLoadTime = System.nanoTime() val df = sqlContext.read.json(...) val startExecTime = System.nanoTime() df.groupBy(...).count.collect val endTime = System.nanoTime() val str = s"${timediff(startLoadTime, startExecTime)}, ${timediff(startExecTime,endTime)}" println(f"[$i%03d/${count}%03d] $str%s") write.println(str) write.flush() } write.close() {code} I can upload core dump, error log and app code if needed. was: I'm submitting my application via spark-submit. It is running a long living Context with many jobs and tasks. For a lot of tasks I get a error message: {quote} 16/07/13 19:46:12 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 1387674 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) {quote} After a while I got erros like: {quote} 16/07/13 19:45:43 ERROR Utils: Uncaught exception in thread task-result-getter-4 java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at java.lang.Class.getConstructor0(Class.java:3082) at java.lang.Class.getConstructor(Class.java:1825) at com.esotericsoftware.kryo.Kryo.newSerializer(Kryo.java:322) at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:303) at com.esotericsoftware.kryo.Kryo.register(Kryo.java:351) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:140) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273) at org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:96) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {quote} Finaly in the end the entire JVM crashed: {quote} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f576f13c7d3, pid=1152, tid=140007008368384 # # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-8u91-b14-1~bpo8+1-b14) # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x6967d3] # # Core dump written. Default location: /home/notebook/nbdata/core or core.1152 # # An error report file with more information is saved as: # /home/notebook/nbdata/hs_err_pid1152.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # Aborted (core dumped) {quote} Inside my application i have a HiveContext and repeatedly run {{sqlContext.read.json(...).groupBy(...).count.collect}} which gives around 10 results from 200 million raw json records input. On my 20 node cluster this spins up ~42000 Tasks for each run. My coding does not store as many data that would cause a driver with 8GB memory go out of memory. So I assume something inside Spark does not cleanup finished tasks correctly. {code} val conf = new SparkConf().setAppName("Benchmark ") val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val write = new java.io.PrintWriter(new java.io.FileOutputStream(outFile, true)) for(i <- 1 to count) { val startLoadTime = System.nanoTime() val df = sqlContext.read.json(...) val startExecTime = System.nanoTime() df.groupBy(...).count.collect val endTime = System.nanoTime() val str = s"${timediff(startLoadTime, startExecTime)}, ${timediff(startExecTime,endTime)}" println(f"[$i%03d/${count}%03d] $str%s") write.println(str) write.flush() } write.close() {code} I can upload core dump, error log and app code if needed. > GC Overhead Limit Reached and Core Dump > --------------------------------------- > > Key: SPARK-16549 > URL: https://issues.apache.org/jira/browse/SPARK-16549 > Project: Spark > Issue Type: Bug > Affects Versions: 1.6.1 > Environment: Mesos, Docker > Reporter: Niels Becker > > I'm submitting my application via spark-submit. It is running a long living > Context with many jobs and tasks. > For a lot of tasks I get a error message: > {quote} > 16/07/13 19:46:12 ERROR TaskSchedulerImpl: Ignoring update with state > FINISHED for TID 1387674 because its task set is gone (this is likely the > result of receiving duplicate task finished status updates) > {quote} > After a while I got erros like: > {quote} > 16/07/13 19:45:43 ERROR Utils: Uncaught exception in thread > task-result-getter-4 > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.Arrays.copyOf(Arrays.java:3332) > at > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) > at java.lang.StringBuilder.append(StringBuilder.java:136) > at java.lang.Class.getConstructor0(Class.java:3082) > at java.lang.Class.getConstructor(Class.java:1825) > at com.esotericsoftware.kryo.Kryo.newSerializer(Kryo.java:322) > at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:303) > at com.esotericsoftware.kryo.Kryo.register(Kryo.java:351) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:140) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273) > at > org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:96) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > Finaly in the end the entire JVM crashed: > {quote} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007f576f13c7d3, pid=1152, tid=140007008368384 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build > 1.8.0_91-8u91-b14-1~bpo8+1-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x6967d3] > # > # Core dump written. Default location: /home/notebook/nbdata/core or core.1152 > # > # An error report file with more information is saved as: > # /home/notebook/nbdata/hs_err_pid1152.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # > Aborted (core dumped) > {quote} > Inside my application i have a HiveContext and repeatedly run > {{sqlContext.read.json(...).groupBy(...).count.collect}} which gives around > 10 results from 200 million raw json records input. On my 20 node cluster > this spins up ~42000 Tasks for each run. > My coding does not store as many data that would cause a driver with 8GB > memory go out of memory. > So I assume something inside Spark does not cleanup finished tasks correctly > or that this strange Error, that ignores an update causes the memory leak. > {code} > val conf = new SparkConf().setAppName("Benchmark ") > val sc = new SparkContext(conf) > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val write = new java.io.PrintWriter(new java.io.FileOutputStream(outFile, > true)) > for(i <- 1 to count) { > val startLoadTime = System.nanoTime() > val df = sqlContext.read.json(...) > val startExecTime = System.nanoTime() > df.groupBy(...).count.collect > val endTime = System.nanoTime() > val str = s"${timediff(startLoadTime, startExecTime)}, > ${timediff(startExecTime,endTime)}" > println(f"[$i%03d/${count}%03d] $str%s") > write.println(str) > write.flush() > } > write.close() > {code} > I can upload core dump, error log and app code if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org