[ 
https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522165#comment-14522165
 ] 

Wes McKinney commented on SPARK-7035:
-------------------------------------

[~rxin] asked me to comment on this issue. 

I actually dragged my feet on the __getattr__ issue for several months back in 
the day, then finally added it (and tab completion in IPython with __dir__), 
and immediately noticed a huge quality-of-life improvement when using pandas 
for actual (esp. interactive) work. You have to accept the 5% pathology that 
column names overlapping with DataFrame functions and names that are not valid 
Python identifiers won't be accessible, for the 95% use case where table.<TAB> 
in IPython proves incredibly useful. I like to call this anti-pattern 
"edge-case driven development". 

Aside: pandas is _far_ from a perfect project, but it has become a beloved 
daily-use tool for hundreds of thousands of people; usability really does 
matter. People are sacrificing significant performance in a lot of cases to use 
Python on Spark simply because they want to program in Python (for productivity 
and usability reasons), so this needs to be an important factor in your 
objective function.

Zen of Python states "there should be one-- and preferably only one -- obvious 
way to do it" and also "practicality beats purity", and I can say from 
experience that the latter wins big here. So even if you kill the feature now, 
the users will eventually clamor for it that you'll be forced to add it. 

You should obviously recommend to people building software with PySpark that 
they prefer __getitem__ when possible. That's the pandas best practice and over 
a period of years it hasn't shown to cause too many problems. 

> Drop __getattr__ on pyspark.sql.DataFrame
> -----------------------------------------
>
>                 Key: SPARK-7035
>                 URL: https://issues.apache.org/jira/browse/SPARK-7035
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 1.4.0
>            Reporter: Kalle Jepsen
>
> I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed.
> There is no point in having the possibility to address the DataFrames columns 
> as {{df.column}}, other than the questionable goal to please R developers. 
> And it seems R people can use Spark from their native API in the future.
> I see the following problems with {{\_\_getattr\_\_}} for column selection:
> * It's un-pythonic: There should only be one obvious way to solve a problem, 
> and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} 
> method, which in my opinion is by far superior and a lot more intuitive.
> * It leads to confusing Exceptions. When we mistype a method-name the 
> {{AttributeError}} will say 'No such column ... '.
> * And most importantly: we cannot load DataFrames that have columns with the 
> same name as any attribute on the DataFrame-object. Imagine having a 
> DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} 
> will be ambiguous and lead to broken code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to