Re: IOError on createDataFrame
Why not attach a bigger hard disk to the machines and point your SPARK_LOCAL_DIRS to it? Thanks Best Regards On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti <fsacerd...@jumptrading.com> wrote: > Hello, > > Similar to the thread below [1], when I tried to create an RDD from a 4GB > pandas dataframe I encountered the error > > TypeError: cannot create an RDD from type: > > However looking into the code shows this is raised from a generic "except > Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A > debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of > space: > > -> rdd = self._sc.parallelize(data) > (Pdb) > *IOError: (28, 'No space left on device')* > > In this case, creating an RDD from a large matrix (~50mill rows) is > required > for us. I'm a bit concerned about spark's process here: > >a. turning the dataframe into records (data.to_records) >b. writing it to tmp >c. reading it back again in scala. > > Is there a better way? The intention would be to operate on slices of this > large dataframe using numpy operations via spark's transformations and > actions. > > Thanks, > FDS > > 1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html > > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: IOError on createDataFrame
There are two issues here: 1. Suppression of the true reason for failure. The spark runtime reports "TypeError" but that is not why the operation failed. 2. The low performance of loading a pandas dataframe. DISCUSSION Number (1) is easily fixed, and the primary purpose for my post. Number (2) is harder, and may lead us to abandon Spark. To answer Akhil, the process is too slow. Yes it will work, but with large dense datasets, the line data = [r.tolist() for r in data.to_records(index=False)] is basically a brick wall. It will take longer to load the RDD than to do all operations on it, by a large margin. Any help or guidance (should we write some custom loader?) would be appreciated. FDS -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888p13912.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
IOError on createDataFrame
Hello, Similar to the thread below [1], when I tried to create an RDD from a 4GB pandas dataframe I encountered the error TypeError: cannot create an RDD from type: type 'list' However looking into the code shows this is raised from a generic except Exception: predicate (pyspark/sql/context.py:238 in spark-1.4.1). A debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of space: - rdd = self._sc.parallelize(data) (Pdb) *IOError: (28, 'No space left on device')* In this case, creating an RDD from a large matrix (~50mill rows) is required for us. I'm a bit concerned about spark's process here: a. turning the dataframe into records (data.to_records) b. writing it to tmp c. reading it back again in scala. Is there a better way? The intention would be to operate on slices of this large dataframe using numpy operations via spark's transformations and actions. Thanks, FDS 1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org