This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 10d7acc81c5 [SPARK-43889][PYTHON] add check for column name for
`__dir__()` to filter out illegal column name
10d7acc81c5 is described below
commit 10d7acc81c5cd1d14abccbe9fe8c600213fb6c30
Author: Beishao Cao <[email protected]>
AuthorDate: Wed May 31 15:58:47 2023 +0900
[SPARK-43889][PYTHON] add check for column name for `__dir__()` to filter
out illegal column name
### What changes were proposed in this pull request?
Add a check for `__dir__()` in `pyspark.sql.dataframe.DataFrame` to filter
out those illegal column name(e.g: name?1, name 1, 2name etc.).
### Why are the changes needed?
1. df.illegal_column_ame is not runnable(like df.name?1 will raise error)
2. In this way, `df.|` won't suggest those illegal name. This behavior is
consistent with pandas.
3. Supplement for 2: This behavior is not consistent with `getattr`,
`getattr(df, 'column with space') `still works even though `df.column with
space` does not. `dir()` can only be consistent with one of these. Pandas
behavior is to have `dir()` consistent with dot notation, so we are choosing to
conform with Pandas; even though there is an argument to choose the other
behavior.
Example with this change:
https://github.com/apache/spark/assets/109033553/a3238b5a-53b6-4994-8f11-c804a5aab53b
### Does this PR introduce _any_ user-facing change?
Will change the output of dir(df). If the user chooses to use the private
method df.__dir__(), they will also notice an output and docstring difference
there.
### How was this patch tested?
New doctest with three assertions. Output where I only ran this test:
<img width="1052" alt="Screenshot 2023-05-30 at 11 12 04 AM"
src="https://github.com/apache/spark/assets/109033553/c727631a-1028-4a24-a341-680e741cec3f">
Also test in databricks notebook with mock code:
```
class DataFrameWithColAttrs(DataFrame):
def __init__(self, df):
super().__init__(df._jdf, df._sql_ctx if df._sql_ctx else df._session)
def __dir__(self):
attrs = set(super().__dir__())
attrs.update(filter(lambda s: s.isidentifier(), self.columns))
return attrs
```
Closes #41393 from BeishaoCao-db/dir-CheckColumnName.
Authored-by: Beishao Cao <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/sql/dataframe.py | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index d98f025c50c..12c445de21d 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -3062,9 +3062,16 @@ class DataFrame(PandasMapOpsMixin,
PandasConversionMixin):
>>> df = df.withColumn('id2', lit(3))
>>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # result includes
id2 and sorted
['i_like_pancakes', 'id', 'id2', 'inputFiles', 'intersect',
'intersectAll', 'isEmpty']
+
+ Don't include columns that are not valid python identifiers.
+
+ >>> df = df.withColumn('1', lit(4))
+ >>> df = df.withColumn('name 1', lit(5))
+ >>> [attr for attr in dir(df) if attr[0] == 'i'][:7] # Doesn't include
1 or name 1
+ ['i_like_pancakes', 'id', 'id2', 'inputFiles', 'intersect',
'intersectAll', 'isEmpty']
"""
attrs = set(super().__dir__())
- attrs.update(self.columns)
+ attrs.update(filter(lambda s: s.isidentifier(), self.columns))
return sorted(attrs)
@overload
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]