[pyspark] Drop __getattr__ on DataFrame

2015-04-21 Thread Karlson
I think the __getattr__ method should be removed from the DataFrame API 
in pyspark.


May I draw the Python folk's attention to the issue 
https://issues.apache.org/jira/browse/SPARK-7035 and invite comments?


Thank you!

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: functools.partial as UserDefinedFunction

2015-03-26 Thread Karlson

Hi,

I've filed a JIRA (https://issues.apache.org/jira/browse/SPARK-6553) and 
suggested a fix (https://github.com/apache/spark/pull/5206).



On 2015-03-25 19:49, Davies Liu wrote:

It’s good to support functools.partial, could you file a JIRA for it?


On Wednesday, March 25, 2015 at 5:42 AM, Karlson wrote:



Hi all,

passing a functools.partial-function as a UserDefinedFunction to
DataFrame.select raises an AttributeException, because 
functools.partial

does not have the attribute __name__. Is there any alternative to
relying on __name__ in pyspark/sql/functions.py:126 ?


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
(mailto:dev-unsubscr...@spark.apache.org)
For additional commands, e-mail: dev-h...@spark.apache.org 
(mailto:dev-h...@spark.apache.org)





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



functools.partial as UserDefinedFunction

2015-03-25 Thread Karlson


Hi all,

passing a functools.partial-function as a UserDefinedFunction to 
DataFrame.select raises an AttributeException, because functools.partial 
does not have the attribute __name__. Is there any alternative to 
relying on __name__ in pyspark/sql/functions.py:126 ?



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Storage of RDDs created via sc.parallelize

2015-03-20 Thread Karlson


Hi all,

where is the data stored that is passed to sc.parallelize? Or put 
differently, where is the data for the base RDD fetched from when the 
DAG is executed, if the base RDD is constructed via sc.parallelize?


I am reading a csv file via the Python csv module and am feeding the 
parsed data chunkwise to sc.parallelize, because the whole file would 
not fit into memory on the driver. Reading the file with sc.textfile 
first is not an option, as there might be linebreaks inside the csv 
fields, preventing me from parsing the file line by line.


The problem I am facing right now is that even though I am feeding only 
one chunk at a time to Spark, I will eventually run out of memory on the 
driver.


Thanks in advance!

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org