[ https://issues.apache.org/jira/browse/SPARK-10362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732537#comment-14732537 ]
Alexey Grishchenko commented on SPARK-10362: -------------------------------------------- _createDataFrame()_ in Python, when called for local collection, would first call _parallelize()_ on your data. _parallelize()_ method for Python works in a following way: it creates temporary file, dumps all your data into it, and then loads this data on Java side. What happens here is that you don't have enough memory in JVM to load this data, so it raises _java.lang.OutOfMemoryError: Java heap space_. As all of these happens on driver, I recommend you to increase driver memory with _spark.driver.memory_ or _--driver-memory_ > Cannot create DataFrame from large pandas.DataFrame > --------------------------------------------------- > > Key: SPARK-10362 > URL: https://issues.apache.org/jira/browse/SPARK-10362 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.4.1 > Environment: Ubuntu 14.04 > Spark 1.4.1 > Reporter: Hsueh-Min Chen > Priority: Minor > > I tried to convert a pandas.DataFrame object to pyspark's DataFrame. It works > for small size of pandas.DataFrame (~10000), but fails for larger size. > >>> sqlc = pyspark.sql.SQLContext(sc) > >>> log = sqlc.createDataFrame(logs.head(10000000)) > --------------------------------------------------------------------------- > Py4JJavaError Traceback (most recent call last) > /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 325 # data could be list, tuple, generator ... > --> 326 rdd = self._sc.parallelize(data) > 327 except Exception: > /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py in > parallelize(self, c, numSlices) > 395 readRDDFromFile = self._jvm.PythonRDD.readRDDFromFile > --> 396 jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices) > 397 return RDD(jrdd, self, serializer) > /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name) > 539 > /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. > : java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:389) > at > org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > TypeError Traceback (most recent call last) > <ipython-input-12-32fb25f5be64> in <module>() > ----> 1 log = sqlc.createDataFrame(logs.head(10000000)) > /home/elsdrm/spark-1.4.1-bin-hadoop2.6/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 326 rdd = self._sc.parallelize(data) > 327 except Exception: > --> 328 raise TypeError("cannot create an RDD from type: %s" > % type(data)) > 329 else: > 330 rdd = data > TypeError: cannot create an RDD from type: <class 'list'> -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org