[jira] [Commented] (SPARK-7035) Drop getattr on pyspark.sql.DataFrame

Nicholas Chammas (JIRA) Sat, 09 May 2015 09:44:05 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14536751#comment-14536751
 ]


Nicholas Chammas commented on SPARK-7035:
-----------------------------------------

{quote}
Zen of Python states "there should be one-- and preferably only one – obvious 
way to do it" and also "practicality beats purity", and I can say from 
experience that the latter wins big here. So even if you kill the feature now, 
the users will eventually clamor for it that you'll be forced to add it.
{quote}

I agree with this. The issue of Pythonic vs. not Pythonic comes up quite often 
in the Python community, and practicality -- especially practicality learned 
from experience, as in Pandas's case -- is part of being Pythonic.

The risk that Kalle points out is real though, and I think we've taken the best 
route by supporting both {{\_\_getattr\_\_}} access and {{\_\_getitem\_\_}} 
access but [discouraging use of 
{{\_\_getattr\_\_}}|https://github.com/apache/spark/pull/5971]. As long as our 
docs and examples always favor the {{\_\_getitem\_\_}} style, I think we are 
fine.

We are fortunate enough to know with confidence from Pandas's extensive 
experience that offering {{\_\_getattr\_\_}} access is a net win.

> Drop __getattr__ on pyspark.sql.DataFrame
> -----------------------------------------
>
>                 Key: SPARK-7035
>                 URL: https://issues.apache.org/jira/browse/SPARK-7035
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 1.4.0
>            Reporter: Kalle Jepsen
>
> I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed.
> There is no point in having the possibility to address the DataFrames columns 
> as {{df.column}}, other than the questionable goal to please R developers. 
> And it seems R people can use Spark from their native API in the future.
> I see the following problems with {{\_\_getattr\_\_}} for column selection:
> * It's un-pythonic: There should only be one obvious way to solve a problem, 
> and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} 
> method, which in my opinion is by far superior and a lot more intuitive.
> * It leads to confusing Exceptions. When we mistype a method-name the 
> {{AttributeError}} will say 'No such column ... '.
> * And most importantly: we cannot load DataFrames that have columns with the 
> same name as any attribute on the DataFrame-object. Imagine having a 
> DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} 
> will be ambiguous and lead to broken code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-7035) Drop __getattr__ on pyspark.sql.DataFrame

Reply via email to

[jira] [Commented] (SPARK-7035) Drop getattr on pyspark.sql.DataFrame