[ 
https://issues.apache.org/jira/browse/SPARK-20809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018384#comment-16018384
 ] 

Sean Owen commented on SPARK-20809:
-----------------------------------

You're setting driver memory in your program -- but that happens after the 
driver has launched. You need to look at the actual driver memory you 
allocated, which is probably only 512m.
Also, it's not clear this is just 1.2g. How big are sentences?

> PySpark: Java heap space issue despite apparently being within memory limits
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-20809
>                 URL: https://issues.apache.org/jira/browse/SPARK-20809
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.1
>         Environment: Linux x86_64
>            Reporter: James Porritt
>
> I have the following script:
> {code}
> import itertools
> import loremipsum
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> conf = SparkConf().set("spark.cores.max", "16") \
>     .set("spark.driver.memory", "16g") \
>     .set("spark.executor.memory", "16g") \
>     .set("spark.executor.memory_overhead", "16g") \
>     .set("spark.driver.maxResultsSize", "0")
> sc = SparkContext(appName="testRDD", conf=conf)
> ss = SparkSession(sc)
> j = itertools.cycle(range(8))
> rows = [(i, j.next(), ' '.join(map(lambda x: x[2], 
> loremipsum.generate_sentences(600)))) for i in range(500)] * 100
> rrd = sc.parallelize(rows, 128)
> {code}
> When I run it with:
> {noformat}
> <system path>/spark-2.1.1-bin-hadoop2.7/bin/spark-submit <home 
> directory>/writeTest.py
> {noformat}
> it fails with a 'Java heap space' error:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
> : java.lang.OutOfMemoryError: Java heap space
>         at 
> org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468)
>         at 
> org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>         at py4j.Gateway.invoke(Gateway.java:280)
>         at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:214)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The data I create here approximates my actual data. The third element of each 
> tuple should be around 25k, and there are 50k tuples overall. I estimate that 
> I should have around 1.2G of data. 
> Why then does it fail? All parts of the system should have enough memory?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to