Is there a foolproof way to access methods exclusively (instead of picking between columns and methods at runtime)? Here are two ideas, neither of which seems particularly Pythonic
- pyspark.sql.methods(df).name() - df.__methods__.name() Punya On Fri, May 8, 2015 at 10:06 AM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > And a link to SPARK-7035 > <https://issues.apache.org/jira/browse/SPARK-7035> (which > Xiangrui mentioned in his initial email) for the lazy. > > On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng <men...@gmail.com> wrote: > > > On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman > > <shiva...@eecs.berkeley.edu> wrote: > > > I dont know much about Python style, but I think the point Wes made > about > > > usability on the JIRA is pretty powerful. IMHO the number of methods > on a > > > Spark DataFrame might not be much more compared to Pandas. Given that > it > > > looks like users are okay with the possibility of collisions in Pandas > I > > > think sticking (1) is not a bad idea. > > > > > > > This is true for interactive work. Spark's DataFrames can handle > > really large datasets, which might be used in production workflows. So > > I think it is reasonable for us to care more about compatibility > > issues than Pandas. > > > > > Also is it possible to detect such collisions in Python ? A (4)th > option > > > might be to detect that `df` contains a column named `name` and print a > > > warning in `df.name` which tells the user that the method is > overriding > > the > > > column. > > > > Maybe we can inspect the frame `df.name` gets called and warn users in > > `df.select(df.name)` but not in `name = df.name`. This could be tricky > > to implement. > > > > -Xiangrui > > > > > > > > Thanks > > > Shivaram > > > > > > > > > On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng <men...@gmail.com> > wrote: > > >> > > >> Hi all, > > >> > > >> In PySpark, a DataFrame column can be referenced using df["abcd"] > > >> (__getitem__) and df.abcd (__getattr__). There is a discussion on > > >> SPARK-7035 on compatibility issues with the __getattr__ approach, and > > >> I want to collect more inputs on this. > > >> > > >> Basically, if in the future we introduce a new method to DataFrame, it > > >> may break user code that uses the same attr to reference a column or > > >> silently changes its behavior. For example, if we add name() to > > >> DataFrame in the next release, all existing code using `df.name` to > > >> reference a column called "name" will break. If we add `name()` as a > > >> property instead of a method, all existing code using `df.name` may > > >> still work but with a different meaning. `df.select(df.name)` no > > >> longer selects the column called "name" but the column that has the > > >> same name as `df.name`. > > >> > > >> There are several proposed solutions: > > >> > > >> 1. Keep both df.abcd and df["abcd"], and encourage users to use the > > >> latter that is future proof. This is the current solution in master > > >> (https://github.com/apache/spark/pull/5971). But I think users may be > > >> still unaware of the compatibility issue and prefer `df.abcd` to > > >> `df["abcd"]` because the former could be auto-completed. > > >> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the > > >> JIRA page: "I actually dragged my feet on the _getattr_ issue for > > >> several months back in the day, then finally added it (and tab > > >> completion in IPython with _dir_), and immediately noticed a huge > > >> quality-of-life improvement when using pandas for actual (esp. > > >> interactive) work." > > >> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and > > >> df["abcd"] would be future proof, and df.abcd_ could be > > >> auto-completed. The tradeoff is apparently the extra "_" appearing in > > >> the code. > > >> > > >> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated. > > >> Thanks! > > >> > > >> Best, > > >> Xiangrui > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > >> For additional commands, e-mail: dev-h...@spark.apache.org > > >> > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > >