Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__
I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas. Given that it looks like users are okay with the possibility of collisions in Pandas I think sticking (1) is not a bad idea. Also is it possible to detect such collisions in Python ? A (4)th option might be to detect that `df` contains a column named `name` and print a warning in `df.name` which tells the user that the method is overriding the column. Thanks Shivaram On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote: Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new method to DataFrame, it may break user code that uses the same attr to reference a column or silently changes its behavior. For example, if we add name() to DataFrame in the next release, all existing code using `df.name` to reference a column called name will break. If we add `name()` as a property instead of a method, all existing code using `df.name` may still work but with a different meaning. `df.select(df.name)` no longer selects the column called name but the column that has the same name as `df.name`. There are several proposed solutions: 1. Keep both df.abcd and df[abcd], and encourage users to use the latter that is future proof. This is the current solution in master (https://github.com/apache/spark/pull/5971). But I think users may be still unaware of the compatibility issue and prefer `df.abcd` to `df[abcd]` because the former could be auto-completed. 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the JIRA page: I actually dragged my feet on the _getattr_ issue for several months back in the day, then finally added it (and tab completion in IPython with _dir_), and immediately noticed a huge quality-of-life improvement when using pandas for actual (esp. interactive) work. 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and df[abcd] would be future proof, and df.abcd_ could be auto-completed. The tradeoff is apparently the extra _ appearing in the code. My preference is 3 1 2. Your inputs would be greatly appreciated. Thanks! Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__
On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas. Given that it looks like users are okay with the possibility of collisions in Pandas I think sticking (1) is not a bad idea. This is true for interactive work. Spark's DataFrames can handle really large datasets, which might be used in production workflows. So I think it is reasonable for us to care more about compatibility issues than Pandas. Also is it possible to detect such collisions in Python ? A (4)th option might be to detect that `df` contains a column named `name` and print a warning in `df.name` which tells the user that the method is overriding the column. Maybe we can inspect the frame `df.name` gets called and warn users in `df.select(df.name)` but not in `name = df.name`. This could be tricky to implement. -Xiangrui Thanks Shivaram On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote: Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new method to DataFrame, it may break user code that uses the same attr to reference a column or silently changes its behavior. For example, if we add name() to DataFrame in the next release, all existing code using `df.name` to reference a column called name will break. If we add `name()` as a property instead of a method, all existing code using `df.name` may still work but with a different meaning. `df.select(df.name)` no longer selects the column called name but the column that has the same name as `df.name`. There are several proposed solutions: 1. Keep both df.abcd and df[abcd], and encourage users to use the latter that is future proof. This is the current solution in master (https://github.com/apache/spark/pull/5971). But I think users may be still unaware of the compatibility issue and prefer `df.abcd` to `df[abcd]` because the former could be auto-completed. 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the JIRA page: I actually dragged my feet on the _getattr_ issue for several months back in the day, then finally added it (and tab completion in IPython with _dir_), and immediately noticed a huge quality-of-life improvement when using pandas for actual (esp. interactive) work. 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and df[abcd] would be future proof, and df.abcd_ could be auto-completed. The tradeoff is apparently the extra _ appearing in the code. My preference is 3 1 2. Your inputs would be greatly appreciated. Thanks! Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__
Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new method to DataFrame, it may break user code that uses the same attr to reference a column or silently changes its behavior. For example, if we add name() to DataFrame in the next release, all existing code using `df.name` to reference a column called name will break. If we add `name()` as a property instead of a method, all existing code using `df.name` may still work but with a different meaning. `df.select(df.name)` no longer selects the column called name but the column that has the same name as `df.name`. There are several proposed solutions: 1. Keep both df.abcd and df[abcd], and encourage users to use the latter that is future proof. This is the current solution in master (https://github.com/apache/spark/pull/5971). But I think users may be still unaware of the compatibility issue and prefer `df.abcd` to `df[abcd]` because the former could be auto-completed. 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the JIRA page: I actually dragged my feet on the _getattr_ issue for several months back in the day, then finally added it (and tab completion in IPython with _dir_), and immediately noticed a huge quality-of-life improvement when using pandas for actual (esp. interactive) work. 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and df[abcd] would be future proof, and df.abcd_ could be auto-completed. The tradeoff is apparently the extra _ appearing in the code. My preference is 3 1 2. Your inputs would be greatly appreciated. Thanks! Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__
Is there a foolproof way to access methods exclusively (instead of picking between columns and methods at runtime)? Here are two ideas, neither of which seems particularly Pythonic - pyspark.sql.methods(df).name() - df.__methods__.name() Punya On Fri, May 8, 2015 at 10:06 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: And a link to SPARK-7035 https://issues.apache.org/jira/browse/SPARK-7035 (which Xiangrui mentioned in his initial email) for the lazy. On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote: On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas. Given that it looks like users are okay with the possibility of collisions in Pandas I think sticking (1) is not a bad idea. This is true for interactive work. Spark's DataFrames can handle really large datasets, which might be used in production workflows. So I think it is reasonable for us to care more about compatibility issues than Pandas. Also is it possible to detect such collisions in Python ? A (4)th option might be to detect that `df` contains a column named `name` and print a warning in `df.name` which tells the user that the method is overriding the column. Maybe we can inspect the frame `df.name` gets called and warn users in `df.select(df.name)` but not in `name = df.name`. This could be tricky to implement. -Xiangrui Thanks Shivaram On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote: Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new method to DataFrame, it may break user code that uses the same attr to reference a column or silently changes its behavior. For example, if we add name() to DataFrame in the next release, all existing code using `df.name` to reference a column called name will break. If we add `name()` as a property instead of a method, all existing code using `df.name` may still work but with a different meaning. `df.select(df.name)` no longer selects the column called name but the column that has the same name as `df.name`. There are several proposed solutions: 1. Keep both df.abcd and df[abcd], and encourage users to use the latter that is future proof. This is the current solution in master (https://github.com/apache/spark/pull/5971). But I think users may be still unaware of the compatibility issue and prefer `df.abcd` to `df[abcd]` because the former could be auto-completed. 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the JIRA page: I actually dragged my feet on the _getattr_ issue for several months back in the day, then finally added it (and tab completion in IPython with _dir_), and immediately noticed a huge quality-of-life improvement when using pandas for actual (esp. interactive) work. 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and df[abcd] would be future proof, and df.abcd_ could be auto-completed. The tradeoff is apparently the extra _ appearing in the code. My preference is 3 1 2. Your inputs would be greatly appreciated. Thanks! Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__
And a link to SPARK-7035 https://issues.apache.org/jira/browse/SPARK-7035 (which Xiangrui mentioned in his initial email) for the lazy. On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote: On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas. Given that it looks like users are okay with the possibility of collisions in Pandas I think sticking (1) is not a bad idea. This is true for interactive work. Spark's DataFrames can handle really large datasets, which might be used in production workflows. So I think it is reasonable for us to care more about compatibility issues than Pandas. Also is it possible to detect such collisions in Python ? A (4)th option might be to detect that `df` contains a column named `name` and print a warning in `df.name` which tells the user that the method is overriding the column. Maybe we can inspect the frame `df.name` gets called and warn users in `df.select(df.name)` but not in `name = df.name`. This could be tricky to implement. -Xiangrui Thanks Shivaram On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote: Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new method to DataFrame, it may break user code that uses the same attr to reference a column or silently changes its behavior. For example, if we add name() to DataFrame in the next release, all existing code using `df.name` to reference a column called name will break. If we add `name()` as a property instead of a method, all existing code using `df.name` may still work but with a different meaning. `df.select(df.name)` no longer selects the column called name but the column that has the same name as `df.name`. There are several proposed solutions: 1. Keep both df.abcd and df[abcd], and encourage users to use the latter that is future proof. This is the current solution in master (https://github.com/apache/spark/pull/5971). But I think users may be still unaware of the compatibility issue and prefer `df.abcd` to `df[abcd]` because the former could be auto-completed. 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the JIRA page: I actually dragged my feet on the _getattr_ issue for several months back in the day, then finally added it (and tab completion in IPython with _dir_), and immediately noticed a huge quality-of-life improvement when using pandas for actual (esp. interactive) work. 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and df[abcd] would be future proof, and df.abcd_ could be auto-completed. The tradeoff is apparently the extra _ appearing in the code. My preference is 3 1 2. Your inputs would be greatly appreciated. Thanks! Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org