[pyspark] Drop __getattr__ on DataFrame
I think the __getattr__ method should be removed from the DataFrame API in pyspark. May I draw the Python folk's attention to the issue https://issues.apache.org/jira/browse/SPARK-7035 and invite comments? Thank you! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: functools.partial as UserDefinedFunction
Hi, I've filed a JIRA (https://issues.apache.org/jira/browse/SPARK-6553) and suggested a fix (https://github.com/apache/spark/pull/5206). On 2015-03-25 19:49, Davies Liu wrote: It’s good to support functools.partial, could you file a JIRA for it? On Wednesday, March 25, 2015 at 5:42 AM, Karlson wrote: Hi all, passing a functools.partial-function as a UserDefinedFunction to DataFrame.select raises an AttributeException, because functools.partial does not have the attribute __name__. Is there any alternative to relying on __name__ in pyspark/sql/functions.py:126 ? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org (mailto:dev-unsubscr...@spark.apache.org) For additional commands, e-mail: dev-h...@spark.apache.org (mailto:dev-h...@spark.apache.org) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
functools.partial as UserDefinedFunction
Hi all, passing a functools.partial-function as a UserDefinedFunction to DataFrame.select raises an AttributeException, because functools.partial does not have the attribute __name__. Is there any alternative to relying on __name__ in pyspark/sql/functions.py:126 ? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Storage of RDDs created via sc.parallelize
Hi all, where is the data stored that is passed to sc.parallelize? Or put differently, where is the data for the base RDD fetched from when the DAG is executed, if the base RDD is constructed via sc.parallelize? I am reading a csv file via the Python csv module and am feeding the parsed data chunkwise to sc.parallelize, because the whole file would not fit into memory on the driver. Reading the file with sc.textfile first is not an option, as there might be linebreaks inside the csv fields, preventing me from parsing the file line by line. The problem I am facing right now is that even though I am feeding only one chunk at a time to Spark, I will eventually run out of memory on the driver. Thanks in advance! - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org