Kalle Jepsen created SPARK-7035:
-----------------------------------

             Summary: Drop __getattr__ on pyspark.sql.DataFrame
                 Key: SPARK-7035
                 URL: https://issues.apache.org/jira/browse/SPARK-7035
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 1.4.0
            Reporter: Kalle Jepsen


I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed.

There is no point in having the possibility to address the DataFrames columns 
as {{df.column}}, other than the questionable goal to please R developers. And 
it seems R people can use Spark from their native API in the future.

I see the following problems with {{\_\_getattr\_\_}} for column selection:

* It's un-pythonic: There should only be one obvious way to solve a problem, 
and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} 
method, which in my opinion is by far superior and a lot more intuitive.

* It leads to confusing Exceptions. When we mistype a method-name the 
{{AttributeError}} will say 'No such column ... '.

* And most importantly: we cannot load DataFrames that have columns with the 
same name as any attribute on the DataFrame-object. Imagine having a DataFrame 
with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} will be 
ambiguous and lead to broken code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to