grundprinzip commented on code in PR #38879:
URL: https://github.com/apache/spark/pull/38879#discussion_r1040567686
##########
python/pyspark/sql/connect/plan.py:
##########
@@ -249,6 +280,7 @@ class Project(LogicalPlan):
def __init__(self, child: Optional["LogicalPlan"], *columns:
"ColumnOrName") -> None:
super().__init__(child)
self._raw_columns = list(columns)
+ _all_of(self._raw_columns, ColumnOrName)
Review Comment:
The problem is that the user surface of Spark Connect can contain lots of
untyped code so if we don't do any checking the user gets horrible error
messages. For example
```
>>> df.orderBy(df.id.asc)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/dataframe.py",
line 202, in __repr__
return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))
TypeError: 'Column' object is not iterable
```
In this case the error is even worse because it's hidden in the repr. Now if
you add a collect
```
>>> df.orderBy(df.id.asc).collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/dataframe.py",
line 1391, in collect
pdf = self.toPandas()
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/dataframe.py",
line 1402, in toPandas
query = self._plan.to_proto(self._session.client)
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py",
line 88, in to_proto
plan.root.CopyFrom(self.plan(session))
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py",
line 513, in plan
plan.sort.sort_fields.extend([self.col_to_sort_field(x, session) for x
in self.columns])
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py",
line 513, in <listcomp>
plan.sort.sort_fields.extend([self.col_to_sort_field(x, session) for x
in self.columns])
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py",
line 504, in col_to_sort_field
sort = SortOrder(ColumnReference(name=col))
File
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/column.py",
line 255, in __init__
self._unparsed_identifier = name.name()
AttributeError: 'function' object has no attribute 'name'
```
The error is still not legible.
Yes, for dictionaries this is not applicable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]