Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Shivaram Venkataraman
I dont know much about Python style, but I think the point Wes made about
usability on the JIRA is pretty powerful. IMHO the number of methods on a
Spark DataFrame might not be much more compared to Pandas. Given that it
looks like users are okay with the possibility of collisions in Pandas I
think sticking (1) is not a bad idea.

Also is it possible to detect such collisions in Python ? A (4)th option
might be to detect that `df` contains a column named `name` and print a
warning in `df.name` which tells the user that the method is overriding the
column.

Thanks
Shivaram


On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi all,

 In PySpark, a DataFrame column can be referenced using df[abcd]
 (__getitem__) and df.abcd (__getattr__). There is a discussion on
 SPARK-7035 on compatibility issues with the __getattr__ approach, and
 I want to collect more inputs on this.

 Basically, if in the future we introduce a new method to DataFrame, it
 may break user code that uses the same attr to reference a column or
 silently changes its behavior. For example, if we add name() to
 DataFrame in the next release, all existing code using `df.name` to
 reference a column called name will break. If we add `name()` as a
 property instead of a method, all existing code using `df.name` may
 still work but with a different meaning. `df.select(df.name)` no
 longer selects the column called name but the column that has the
 same name as `df.name`.

 There are several proposed solutions:

 1. Keep both df.abcd and df[abcd], and encourage users to use the
 latter that is future proof. This is the current solution in master
 (https://github.com/apache/spark/pull/5971). But I think users may be
 still unaware of the compatibility issue and prefer `df.abcd` to
 `df[abcd]` because the former could be auto-completed.
 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the
 JIRA page: I actually dragged my feet on the _getattr_ issue for
 several months back in the day, then finally added it (and tab
 completion in IPython with _dir_), and immediately noticed a huge
 quality-of-life improvement when using pandas for actual (esp.
 interactive) work.
 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and
 df[abcd] would be future proof, and df.abcd_ could be
 auto-completed. The tradeoff is apparently the extra _ appearing in
 the code.

 My preference is 3  1  2. Your inputs would be greatly appreciated.
 Thanks!

 Best,
 Xiangrui

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 I dont know much about Python style, but I think the point Wes made about
 usability on the JIRA is pretty powerful. IMHO the number of methods on a
 Spark DataFrame might not be much more compared to Pandas. Given that it
 looks like users are okay with the possibility of collisions in Pandas I
 think sticking (1) is not a bad idea.


This is true for interactive work. Spark's DataFrames can handle
really large datasets, which might be used in production workflows. So
I think it is reasonable for us to care more about compatibility
issues than Pandas.

 Also is it possible to detect such collisions in Python ? A (4)th option
 might be to detect that `df` contains a column named `name` and print a
 warning in `df.name` which tells the user that the method is overriding the
 column.

Maybe we can inspect the frame `df.name` gets called and warn users in
`df.select(df.name)` but not in `name = df.name`. This could be tricky
to implement.

-Xiangrui


 Thanks
 Shivaram


 On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi all,

 In PySpark, a DataFrame column can be referenced using df[abcd]
 (__getitem__) and df.abcd (__getattr__). There is a discussion on
 SPARK-7035 on compatibility issues with the __getattr__ approach, and
 I want to collect more inputs on this.

 Basically, if in the future we introduce a new method to DataFrame, it
 may break user code that uses the same attr to reference a column or
 silently changes its behavior. For example, if we add name() to
 DataFrame in the next release, all existing code using `df.name` to
 reference a column called name will break. If we add `name()` as a
 property instead of a method, all existing code using `df.name` may
 still work but with a different meaning. `df.select(df.name)` no
 longer selects the column called name but the column that has the
 same name as `df.name`.

 There are several proposed solutions:

 1. Keep both df.abcd and df[abcd], and encourage users to use the
 latter that is future proof. This is the current solution in master
 (https://github.com/apache/spark/pull/5971). But I think users may be
 still unaware of the compatibility issue and prefer `df.abcd` to
 `df[abcd]` because the former could be auto-completed.
 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the
 JIRA page: I actually dragged my feet on the _getattr_ issue for
 several months back in the day, then finally added it (and tab
 completion in IPython with _dir_), and immediately noticed a huge
 quality-of-life improvement when using pandas for actual (esp.
 interactive) work.
 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and
 df[abcd] would be future proof, and df.abcd_ could be
 auto-completed. The tradeoff is apparently the extra _ appearing in
 the code.

 My preference is 3  1  2. Your inputs would be greatly appreciated.
 Thanks!

 Best,
 Xiangrui

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
Hi all,

In PySpark, a DataFrame column can be referenced using df[abcd]
(__getitem__) and df.abcd (__getattr__). There is a discussion on
SPARK-7035 on compatibility issues with the __getattr__ approach, and
I want to collect more inputs on this.

Basically, if in the future we introduce a new method to DataFrame, it
may break user code that uses the same attr to reference a column or
silently changes its behavior. For example, if we add name() to
DataFrame in the next release, all existing code using `df.name` to
reference a column called name will break. If we add `name()` as a
property instead of a method, all existing code using `df.name` may
still work but with a different meaning. `df.select(df.name)` no
longer selects the column called name but the column that has the
same name as `df.name`.

There are several proposed solutions:

1. Keep both df.abcd and df[abcd], and encourage users to use the
latter that is future proof. This is the current solution in master
(https://github.com/apache/spark/pull/5971). But I think users may be
still unaware of the compatibility issue and prefer `df.abcd` to
`df[abcd]` because the former could be auto-completed.
2. Drop df.abcd and support df[abcd] only. From Wes' comment on the
JIRA page: I actually dragged my feet on the _getattr_ issue for
several months back in the day, then finally added it (and tab
completion in IPython with _dir_), and immediately noticed a huge
quality-of-life improvement when using pandas for actual (esp.
interactive) work.
3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and
df[abcd] would be future proof, and df.abcd_ could be
auto-completed. The tradeoff is apparently the extra _ appearing in
the code.

My preference is 3  1  2. Your inputs would be greatly appreciated. Thanks!

Best,
Xiangrui

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Punyashloka Biswal
Is there a foolproof way to access methods exclusively (instead of picking
between columns and methods at runtime)? Here are two ideas, neither of
which seems particularly Pythonic

   - pyspark.sql.methods(df).name()
   - df.__methods__.name()

Punya

On Fri, May 8, 2015 at 10:06 AM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 And a link to SPARK-7035
 https://issues.apache.org/jira/browse/SPARK-7035 (which
 Xiangrui mentioned in his initial email) for the lazy.

 On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote:

  On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I dont know much about Python style, but I think the point Wes made
 about
   usability on the JIRA is pretty powerful. IMHO the number of methods
 on a
   Spark DataFrame might not be much more compared to Pandas. Given that
 it
   looks like users are okay with the possibility of collisions in Pandas
 I
   think sticking (1) is not a bad idea.
  
 
  This is true for interactive work. Spark's DataFrames can handle
  really large datasets, which might be used in production workflows. So
  I think it is reasonable for us to care more about compatibility
  issues than Pandas.
 
   Also is it possible to detect such collisions in Python ? A (4)th
 option
   might be to detect that `df` contains a column named `name` and print a
   warning in `df.name` which tells the user that the method is
 overriding
  the
   column.
 
  Maybe we can inspect the frame `df.name` gets called and warn users in
  `df.select(df.name)` but not in `name = df.name`. This could be tricky
  to implement.
 
  -Xiangrui
 
  
   Thanks
   Shivaram
  
  
   On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com
 wrote:
  
   Hi all,
  
   In PySpark, a DataFrame column can be referenced using df[abcd]
   (__getitem__) and df.abcd (__getattr__). There is a discussion on
   SPARK-7035 on compatibility issues with the __getattr__ approach, and
   I want to collect more inputs on this.
  
   Basically, if in the future we introduce a new method to DataFrame, it
   may break user code that uses the same attr to reference a column or
   silently changes its behavior. For example, if we add name() to
   DataFrame in the next release, all existing code using `df.name` to
   reference a column called name will break. If we add `name()` as a
   property instead of a method, all existing code using `df.name` may
   still work but with a different meaning. `df.select(df.name)` no
   longer selects the column called name but the column that has the
   same name as `df.name`.
  
   There are several proposed solutions:
  
   1. Keep both df.abcd and df[abcd], and encourage users to use the
   latter that is future proof. This is the current solution in master
   (https://github.com/apache/spark/pull/5971). But I think users may be
   still unaware of the compatibility issue and prefer `df.abcd` to
   `df[abcd]` because the former could be auto-completed.
   2. Drop df.abcd and support df[abcd] only. From Wes' comment on the
   JIRA page: I actually dragged my feet on the _getattr_ issue for
   several months back in the day, then finally added it (and tab
   completion in IPython with _dir_), and immediately noticed a huge
   quality-of-life improvement when using pandas for actual (esp.
   interactive) work.
   3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and
   df[abcd] would be future proof, and df.abcd_ could be
   auto-completed. The tradeoff is apparently the extra _ appearing in
   the code.
  
   My preference is 3  1  2. Your inputs would be greatly appreciated.
   Thanks!
  
   Best,
   Xiangrui
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Nicholas Chammas
And a link to SPARK-7035
https://issues.apache.org/jira/browse/SPARK-7035 (which
Xiangrui mentioned in his initial email) for the lazy.

On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote:

 On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  I dont know much about Python style, but I think the point Wes made about
  usability on the JIRA is pretty powerful. IMHO the number of methods on a
  Spark DataFrame might not be much more compared to Pandas. Given that it
  looks like users are okay with the possibility of collisions in Pandas I
  think sticking (1) is not a bad idea.
 

 This is true for interactive work. Spark's DataFrames can handle
 really large datasets, which might be used in production workflows. So
 I think it is reasonable for us to care more about compatibility
 issues than Pandas.

  Also is it possible to detect such collisions in Python ? A (4)th option
  might be to detect that `df` contains a column named `name` and print a
  warning in `df.name` which tells the user that the method is overriding
 the
  column.

 Maybe we can inspect the frame `df.name` gets called and warn users in
 `df.select(df.name)` but not in `name = df.name`. This could be tricky
 to implement.

 -Xiangrui

 
  Thanks
  Shivaram
 
 
  On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi all,
 
  In PySpark, a DataFrame column can be referenced using df[abcd]
  (__getitem__) and df.abcd (__getattr__). There is a discussion on
  SPARK-7035 on compatibility issues with the __getattr__ approach, and
  I want to collect more inputs on this.
 
  Basically, if in the future we introduce a new method to DataFrame, it
  may break user code that uses the same attr to reference a column or
  silently changes its behavior. For example, if we add name() to
  DataFrame in the next release, all existing code using `df.name` to
  reference a column called name will break. If we add `name()` as a
  property instead of a method, all existing code using `df.name` may
  still work but with a different meaning. `df.select(df.name)` no
  longer selects the column called name but the column that has the
  same name as `df.name`.
 
  There are several proposed solutions:
 
  1. Keep both df.abcd and df[abcd], and encourage users to use the
  latter that is future proof. This is the current solution in master
  (https://github.com/apache/spark/pull/5971). But I think users may be
  still unaware of the compatibility issue and prefer `df.abcd` to
  `df[abcd]` because the former could be auto-completed.
  2. Drop df.abcd and support df[abcd] only. From Wes' comment on the
  JIRA page: I actually dragged my feet on the _getattr_ issue for
  several months back in the day, then finally added it (and tab
  completion in IPython with _dir_), and immediately noticed a huge
  quality-of-life improvement when using pandas for actual (esp.
  interactive) work.
  3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and
  df[abcd] would be future proof, and df.abcd_ could be
  auto-completed. The tradeoff is apparently the extra _ appearing in
  the code.
 
  My preference is 3  1  2. Your inputs would be greatly appreciated.
  Thanks!
 
  Best,
  Xiangrui
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org