Re: IOError on createDataFrame

2015-08-31 Thread Akhil Das
Why not attach a bigger hard disk to the machines and point your
SPARK_LOCAL_DIRS to it?

Thanks
Best Regards

On Sat, Aug 29, 2015 at 1:13 AM, fsacerdoti <fsacerd...@jumptrading.com>
wrote:

> Hello,
>
> Similar to the thread below [1], when I tried to create an RDD from a 4GB
> pandas dataframe I encountered the error
>
> TypeError: cannot create an RDD from type: 
>
> However looking into the code shows this is raised from a generic "except
> Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
> debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
> space:
>
> -> rdd = self._sc.parallelize(data)
> (Pdb)
> *IOError: (28, 'No space left on device')*
>
> In this case, creating an RDD from a large matrix (~50mill rows) is
> required
> for us. I'm a bit concerned about spark's process here:
>
>a. turning the dataframe into records (data.to_records)
>b. writing it to tmp
>c. reading it back again in scala.
>
> Is there a better way? The intention would be to operate on slices of this
> large dataframe using numpy operations via spark's transformations and
> actions.
>
> Thanks,
> FDS
>
> 1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: IOError on createDataFrame

2015-08-31 Thread fsacerdoti
There are two issues here:

1. Suppression of the true reason for failure. The spark runtime reports
"TypeError" but that is not why the operation failed.

2. The low performance of loading a pandas dataframe.


DISCUSSION

Number (1) is easily fixed, and the primary purpose for my post.
Number (2) is harder, and may lead us to abandon Spark. To answer Akhil, the
process is too slow. Yes it will work, but with large dense datasets, the
line

data = [r.tolist() for r in data.to_records(index=False)]

is basically a brick wall. It will take longer to load the RDD than to do
all operations on it, by a large margin.

Any help or guidance (should we write some custom loader?) would be
appreciated.

FDS



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888p13912.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



IOError on createDataFrame

2015-08-28 Thread fsacerdoti
Hello,

Similar to the thread below [1], when I tried to create an RDD from a 4GB
pandas dataframe I encountered the error

TypeError: cannot create an RDD from type: type 'list'

However looking into the code shows this is raised from a generic except
Exception: predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
space:

- rdd = self._sc.parallelize(data)
(Pdb) 
*IOError: (28, 'No space left on device')*

In this case, creating an RDD from a large matrix (~50mill rows) is required
for us. I'm a bit concerned about spark's process here:

   a. turning the dataframe into records (data.to_records)
   b. writing it to tmp
   c. reading it back again in scala.

Is there a better way? The intention would be to operate on slices of this
large dataframe using numpy operations via spark's transformations and
actions.

Thanks,
FDS
 
1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org