Hello,

Similar to the thread below [1], when I tried to create an RDD from a 4GB
pandas dataframe I encountered the error

    TypeError: cannot create an RDD from type: <type 'list'>

However looking into the code shows this is raised from a generic "except
Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
space:

-> rdd = self._sc.parallelize(data)
(Pdb) 
*IOError: (28, 'No space left on device')*

In this case, creating an RDD from a large matrix (~50mill rows) is required
for us. I'm a bit concerned about spark's process here:

   a. turning the dataframe into records (data.to_records)
   b. writing it to tmp
   c. reading it back again in scala.

Is there a better way? The intention would be to operate on slices of this
large dataframe using numpy operations via spark's transformations and
actions.

Thanks,
FDS
 
1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to