[
https://issues.apache.org/jira/browse/SPARK-13909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jakub Liska updated SPARK-13909:
--------------------------------
Description:
Hey, I migrated to 1.6.0, and suddenly `persist` behaves as if it was
`MEMORY_ONLY` instead of `DISK_ONLY` so that it eventually ends with OOME.
However if I remove `persist` it works fine. I'm calling this snippet from
Zeppelin notebook :
{code}
val coreRdd =
sc.textFile("s3n://gwiq-views-p/external/core/tsv/*.tsv").map(_.split("\t")).map(
fields => Row(fields:_*) )
val coreDataFrame = sqlContext.createDataFrame(coreRdd, schema)
coreDataFrame.registerTempTable("core")
coreDataFrame.persist(StorageLevel.DISK_ONLY)
{code}
{code}
SELECT COUNT(*) FROM core
{code}
{code}
------ Create new SparkContext spark://master:7077 -------
Exception in thread "pool-1-thread-5" java.lang.OutOfMemoryError: GC overhead
limit exceeded
at
com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:66)
at
com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:69)
at
com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:188)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:146)
at com.google.gson.Gson.fromJson(Gson.java:791)
at com.google.gson.Gson.fromJson(Gson.java:757)
at com.google.gson.Gson.fromJson(Gson.java:706)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.convert(RemoteInterpreterServer.java:417)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.getProgress(RemoteInterpreterServer.java:384)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1376)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1361)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
I'm using this https://github.com/gettyimages/docker-spark setup with a
zeppelin docker container...
{code}
ZEPPELIN_JAVA_OPTS: -Dspark.executor.memory=16g
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
-Dspark.app.id=zeppelin
SPARK_SUBMIT_OPTIONS: --driver-memory 1g --repositories
https://oss.sonatype.org/content/repositories/snapshots --packages
com.viagraphs:spark-extensions_2.10:1.04-SNAPSHOT
--jars=file:/usr/spark-1.6.0-bin-hadoop2.6/lib/aws-java-sdk-1.7.14.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/hadoop-aws-2.6.0.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/google-collections-1.0.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/joda-time-2.8.2.jar
SPARK_WORKER_CORES: 8
SPARK_WORKER_MEMORY: 16g
{code}
was:
Hey, I migrated to 1.6.0, and suddenly `persist` behaves as if it was
`MEMORY_ONLY` instead of `DISK_ONLY` so that it eventually ends with OOME.
However if I remove `persist` it works fine. I'm calling this snippet from
Zeppelin notebook :
{code}
val coreRdd =
sc.textFile("s3n://gwiq-views-p/external/core/tsv/*.tsv").map(_.split("\t")).map(
fields => Row(fields:_*) )
val coreDataFrame = sqlContext.createDataFrame(coreRdd, schema)
coreDataFrame.registerTempTable("core")
coreDataFrame.persist(StorageLevel.DISK_ONLY)
{code}
{code}
SELECT COUNT(*) FROM core
{code}
{code}
------ Create new SparkContext spark://master:7077 -------
Exception in thread "pool-1-thread-5" java.lang.OutOfMemoryError: GC overhead
limit exceeded
at
com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:66)
at
com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:69)
at
com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:188)
at
com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:146)
at com.google.gson.Gson.fromJson(Gson.java:791)
at com.google.gson.Gson.fromJson(Gson.java:757)
at com.google.gson.Gson.fromJson(Gson.java:706)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.convert(RemoteInterpreterServer.java:417)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.getProgress(RemoteInterpreterServer.java:384)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1376)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1361)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
> DataFrames DISK_ONLY persistence leads to OOME
> ----------------------------------------------
>
> Key: SPARK-13909
> URL: https://issues.apache.org/jira/browse/SPARK-13909
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.0
> Environment: debian:jessie, java 1.8, hadoop 2.6.0, current zeppelin
> snapshot
> Reporter: Jakub Liska
> Labels: dataframe
>
> Hey, I migrated to 1.6.0, and suddenly `persist` behaves as if it was
> `MEMORY_ONLY` instead of `DISK_ONLY` so that it eventually ends with OOME.
> However if I remove `persist` it works fine. I'm calling this snippet from
> Zeppelin notebook :
> {code}
> val coreRdd =
> sc.textFile("s3n://gwiq-views-p/external/core/tsv/*.tsv").map(_.split("\t")).map(
> fields => Row(fields:_*) )
> val coreDataFrame = sqlContext.createDataFrame(coreRdd, schema)
> coreDataFrame.registerTempTable("core")
> coreDataFrame.persist(StorageLevel.DISK_ONLY)
> {code}
> {code}
> SELECT COUNT(*) FROM core
> {code}
> {code}
> ------ Create new SparkContext spark://master:7077 -------
> Exception in thread "pool-1-thread-5" java.lang.OutOfMemoryError: GC overhead
> limit exceeded
> at
> com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:66)
> at
> com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:69)
> at
> com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40)
> at
> com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:188)
> at
> com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:146)
> at com.google.gson.Gson.fromJson(Gson.java:791)
> at com.google.gson.Gson.fromJson(Gson.java:757)
> at com.google.gson.Gson.fromJson(Gson.java:706)
> at
> org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.convert(RemoteInterpreterServer.java:417)
> at
> org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.getProgress(RemoteInterpreterServer.java:384)
> at
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1376)
> at
> org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1361)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm using this https://github.com/gettyimages/docker-spark setup with a
> zeppelin docker container...
> {code}
> ZEPPELIN_JAVA_OPTS: -Dspark.executor.memory=16g
> -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
> -Dspark.app.id=zeppelin
> SPARK_SUBMIT_OPTIONS: --driver-memory 1g --repositories
> https://oss.sonatype.org/content/repositories/snapshots --packages
> com.viagraphs:spark-extensions_2.10:1.04-SNAPSHOT
> --jars=file:/usr/spark-1.6.0-bin-hadoop2.6/lib/aws-java-sdk-1.7.14.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/hadoop-aws-2.6.0.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/google-collections-1.0.jar,file:/usr/spark-1.6.0-bin-hadoop2.6/lib/joda-time-2.8.2.jar
> SPARK_WORKER_CORES: 8
> SPARK_WORKER_MEMORY: 16g
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]