[jira] [Commented] (SPARK-7035) Drop getattr on pyspark.sql.DataFrame

Kalle Jepsen (JIRA) Wed, 22 Apr 2015 01:53:52 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506671#comment-14506671
 ]


Kalle Jepsen commented on SPARK-7035:
-------------------------------------

1. Well, the interface being un-pythonic is the weakest point of my three. 
Still I believe that Pandas is not exactly an authority on good Python style, 
so the argument that it's pythonic because Pandas supports it does not hold.

2. I agree on the Exception part

3. The problem with collisions between column names and attributes names 
remains. It's inconsistent and will break code. Why should some columns be 
accessible by `df.columnname` but others not?

Imagine having code that relies on a column named `something` by using 
`some_func(df.something)` and it's working perfectly fine. Now at some point in 
the future an attribute `something` is added to the DataFrame API. The column 
will no longer be accessible like that, your application will break. Using 
`some_func(df['something'])` instead is robust.

One might argue that it's up to the user to choose between the two, but that's 
exactly what I meant when I said it's unpythonic: There's more than one obvious 
way to do it and one of them is dangerous, the other perfectly fine.

There's an easy fix, it's still early enough to make such API changes and we 
won't lose anything. In fact I think we gain a whole lot more by reducing the 
size of the code and the risk of errors.

> Drop __getattr__ on pyspark.sql.DataFrame
> -----------------------------------------
>
>                 Key: SPARK-7035
>                 URL: https://issues.apache.org/jira/browse/SPARK-7035
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.4.0
>            Reporter: Kalle Jepsen
>
> I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed.
> There is no point in having the possibility to address the DataFrames columns 
> as {{df.column}}, other than the questionable goal to please R developers. 
> And it seems R people can use Spark from their native API in the future.
> I see the following problems with {{\_\_getattr\_\_}} for column selection:
> * It's un-pythonic: There should only be one obvious way to solve a problem, 
> and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} 
> method, which in my opinion is by far superior and a lot more intuitive.
> * It leads to confusing Exceptions. When we mistype a method-name the 
> {{AttributeError}} will say 'No such column ... '.
> * And most importantly: we cannot load DataFrames that have columns with the 
> same name as any attribute on the DataFrame-object. Imagine having a 
> DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} 
> will be ambiguous and lead to broken code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-7035) Drop __getattr__ on pyspark.sql.DataFrame

Reply via email to

[jira] [Commented] (SPARK-7035) Drop getattr on pyspark.sql.DataFrame