Hello, Similar to the thread below [1], when I tried to create an RDD from a 4GB pandas dataframe I encountered the error
TypeError: cannot create an RDD from type: <type 'list'> However looking into the code shows this is raised from a generic "except Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of space: -> rdd = self._sc.parallelize(data) (Pdb) *IOError: (28, 'No space left on device')* In this case, creating an RDD from a large matrix (~50mill rows) is required for us. I'm a bit concerned about spark's process here: a. turning the dataframe into records (data.to_records) b. writing it to tmp c. reading it back again in scala. Is there a better way? The intention would be to operate on slices of this large dataframe using numpy operations via spark's transformations and actions. Thanks, FDS 1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org