[pyspark] Drop __getattr__ on DataFrame

2015-04-21 Thread Karlson
I think the __getattr__ method should be removed from the DataFrame API in pyspark. May I draw the Python folk's attention to the issue https://issues.apache.org/jira/browse/SPARK-7035 and invite comments? Thank you! - To

Re: functools.partial as UserDefinedFunction

2015-03-26 Thread Karlson
, Karlson wrote: Hi all, passing a functools.partial-function as a UserDefinedFunction to DataFrame.select raises an AttributeException, because functools.partial does not have the attribute __name__. Is there any alternative to relying on __name__ in pyspark/sql/functions.py:126

functools.partial as UserDefinedFunction

2015-03-25 Thread Karlson
Hi all, passing a functools.partial-function as a UserDefinedFunction to DataFrame.select raises an AttributeException, because functools.partial does not have the attribute __name__. Is there any alternative to relying on __name__ in pyspark/sql/functions.py:126 ?

Storage of RDDs created via sc.parallelize

2015-03-20 Thread Karlson
Hi all, where is the data stored that is passed to sc.parallelize? Or put differently, where is the data for the base RDD fetched from when the DAG is executed, if the base RDD is constructed via sc.parallelize? I am reading a csv file via the Python csv module and am feeding the parsed