[ 
https://issues.apache.org/jira/browse/SPARK-16549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niels Becker updated SPARK-16549:
---------------------------------
    Description: 
I'm submitting my application via spark-submit. It is running a long living 
Context with many jobs and tasks.
My 20 Node Cluster is manged by Mesos 0.28.2 and Spark runs inside Docker 
Containers.

For a lot of tasks I get a error message:
{quote}
16/07/13 19:46:12 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED 
for TID 1387674 because its task set is gone (this is likely the result of 
receiving duplicate task finished status updates)
{quote}
This message shows up around 15 times per second.

After a while I got erros like:
{quote}
16/07/13 19:45:43 ERROR Utils: Uncaught exception in thread task-result-getter-4
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
        at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at java.lang.Class.getConstructor0(Class.java:3082)
        at java.lang.Class.getConstructor(Class.java:1825)
        at com.esotericsoftware.kryo.Kryo.newSerializer(Kryo.java:322)
        at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:303)
        at com.esotericsoftware.kryo.Kryo.register(Kryo.java:351)
        at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:140)
        at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
        at 
org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258)
        at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
        at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:96)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{quote}

Finaly in the end the entire JVM crashed:
{quote}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f576f13c7d3, pid=1152, tid=140007008368384
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 
1.8.0_91-8u91-b14-1~bpo8+1-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6967d3]
#
# Core dump written. Default location: /home/notebook/nbdata/core or core.1152
#
# An error report file with more information is saved as:
# /home/notebook/nbdata/hs_err_pid1152.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)
{quote}

Inside my application i have a HiveContext and repeatedly run 
{{sqlContext.read.json(...).groupBy(...).count.collect}} which gives around 10 
results from 200 million raw json records input. This spins up ~42000 Tasks for 
each iteration. 
My coding does not store as many data that would cause a driver with 8GB memory 
go out of memory. 
So I assume something inside Spark does not cleanup finished tasks correctly or 
that this strange Error, that the TaskScheduler ignores an update, causes the 
memory leak.

{code}
val conf = new SparkConf().setAppName("Benchmark ")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val write = new java.io.PrintWriter(new java.io.FileOutputStream(outFile, 
true))  
for(i <- 1 to count) {
    val startLoadTime = System.nanoTime()
    val df = sqlContext.read.json(...)
    val startExecTime = System.nanoTime()
    df.groupBy(...).count.collect
    val endTime = System.nanoTime()
    val str = s"${timediff(startLoadTime, startExecTime)}, 
${timediff(startExecTime,endTime)}"
    println(f"[$i%03d/${count}%03d] $str%s")
    write.println(str)
    write.flush()
}
write.close()
{code}
I can upload core dump, error log and app code if needed.

EDIT: I can kind of reproduce this. It crashes always, but it takes randomly 
between 20-50 iterations.

  was:
I'm submitting my application via spark-submit. It is running a long living 
Context with many jobs and tasks.

For a lot of tasks I get a error message:
{quote}
16/07/13 19:46:12 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED 
for TID 1387674 because its task set is gone (this is likely the result of 
receiving duplicate task finished status updates)
{quote}

After a while I got erros like:
{quote}
16/07/13 19:45:43 ERROR Utils: Uncaught exception in thread task-result-getter-4
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
        at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at java.lang.Class.getConstructor0(Class.java:3082)
        at java.lang.Class.getConstructor(Class.java:1825)
        at com.esotericsoftware.kryo.Kryo.newSerializer(Kryo.java:322)
        at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:303)
        at com.esotericsoftware.kryo.Kryo.register(Kryo.java:351)
        at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:140)
        at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
        at 
org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258)
        at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
        at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:96)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
        at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{quote}

Finaly in the end the entire JVM crashed:
{quote}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f576f13c7d3, pid=1152, tid=140007008368384
#
# JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 
1.8.0_91-8u91-b14-1~bpo8+1-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.so+0x6967d3]
#
# Core dump written. Default location: /home/notebook/nbdata/core or core.1152
#
# An error report file with more information is saved as:
# /home/notebook/nbdata/hs_err_pid1152.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Aborted (core dumped)
{quote}

Inside my application i have a HiveContext and repeatedly run 
{{sqlContext.read.json(...).groupBy(...).count.collect}} which gives around 10 
results from 200 million raw json records input. On my 20 node cluster this 
spins up ~42000 Tasks for each run. 
My coding does not store as many data that would cause a driver with 8GB memory 
go out of memory. 
So I assume something inside Spark does not cleanup finished tasks correctly or 
that this strange Error, that ignores an update causes the memory leak.

{code}
val conf = new SparkConf().setAppName("Benchmark ")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val write = new java.io.PrintWriter(new java.io.FileOutputStream(outFile, 
true))  
for(i <- 1 to count) {
    val startLoadTime = System.nanoTime()
    val df = sqlContext.read.json(...)
    val startExecTime = System.nanoTime()
    df.groupBy(...).count.collect
    val endTime = System.nanoTime()
    val str = s"${timediff(startLoadTime, startExecTime)}, 
${timediff(startExecTime,endTime)}"
    println(f"[$i%03d/${count}%03d] $str%s")
    write.println(str)
    write.flush()
}
write.close()
{code}
I can upload core dump, error log and app code if needed.


> GC Overhead Limit Reached and Core Dump
> ---------------------------------------
>
>                 Key: SPARK-16549
>                 URL: https://issues.apache.org/jira/browse/SPARK-16549
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>         Environment: Mesos, Docker
>            Reporter: Niels Becker
>
> I'm submitting my application via spark-submit. It is running a long living 
> Context with many jobs and tasks.
> My 20 Node Cluster is manged by Mesos 0.28.2 and Spark runs inside Docker 
> Containers.
> For a lot of tasks I get a error message:
> {quote}
> 16/07/13 19:46:12 ERROR TaskSchedulerImpl: Ignoring update with state 
> FINISHED for TID 1387674 because its task set is gone (this is likely the 
> result of receiving duplicate task finished status updates)
> {quote}
> This message shows up around 15 times per second.
> After a while I got erros like:
> {quote}
> 16/07/13 19:45:43 ERROR Utils: Uncaught exception in thread 
> task-result-getter-4
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at java.util.Arrays.copyOf(Arrays.java:3332)
>         at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
>         at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
>         at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
>         at java.lang.StringBuilder.append(StringBuilder.java:136)
>         at java.lang.Class.getConstructor0(Class.java:3082)
>         at java.lang.Class.getConstructor(Class.java:1825)
>         at com.esotericsoftware.kryo.Kryo.newSerializer(Kryo.java:322)
>         at com.esotericsoftware.kryo.Kryo.getDefaultSerializer(Kryo.java:303)
>         at com.esotericsoftware.kryo.Kryo.register(Kryo.java:351)
>         at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:140)
>         at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
>         at 
> org.apache.spark.serializer.KryoSerializerInstance.<init>(KryoSerializer.scala:258)
>         at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
>         at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:96)
>         at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60)
>         at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
>         at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
>         at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
>         at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {quote}
> Finaly in the end the entire JVM crashed:
> {quote}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f576f13c7d3, pid=1152, tid=140007008368384
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 
> 1.8.0_91-8u91-b14-1~bpo8+1-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x6967d3]
> #
> # Core dump written. Default location: /home/notebook/nbdata/core or core.1152
> #
> # An error report file with more information is saved as:
> # /home/notebook/nbdata/hs_err_pid1152.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Aborted (core dumped)
> {quote}
> Inside my application i have a HiveContext and repeatedly run 
> {{sqlContext.read.json(...).groupBy(...).count.collect}} which gives around 
> 10 results from 200 million raw json records input. This spins up ~42000 
> Tasks for each iteration. 
> My coding does not store as many data that would cause a driver with 8GB 
> memory go out of memory. 
> So I assume something inside Spark does not cleanup finished tasks correctly 
> or that this strange Error, that the TaskScheduler ignores an update, causes 
> the memory leak.
> {code}
> val conf = new SparkConf().setAppName("Benchmark ")
> val sc = new SparkContext(conf)
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val write = new java.io.PrintWriter(new java.io.FileOutputStream(outFile, 
> true))  
> for(i <- 1 to count) {
>     val startLoadTime = System.nanoTime()
>     val df = sqlContext.read.json(...)
>     val startExecTime = System.nanoTime()
>     df.groupBy(...).count.collect
>     val endTime = System.nanoTime()
>     val str = s"${timediff(startLoadTime, startExecTime)}, 
> ${timediff(startExecTime,endTime)}"
>     println(f"[$i%03d/${count}%03d] $str%s")
>     write.println(str)
>     write.flush()
> }
> write.close()
> {code}
> I can upload core dump, error log and app code if needed.
> EDIT: I can kind of reproduce this. It crashes always, but it takes randomly 
> between 20-50 iterations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to