[GitHub] [spark] grundprinzip commented on a diff in pull request #38879: [SPARK-41362][CONNECT][PYTHON] Better error messages for invalid argument types.

GitBox Mon, 05 Dec 2022 22:56:17 -0800


grundprinzip commented on code in PR #38879:
URL: https://github.com/apache/spark/pull/38879#discussion_r1040567686



##########
python/pyspark/sql/connect/plan.py:
##########
@@ -249,6 +280,7 @@ class Project(LogicalPlan):
     def __init__(self, child: Optional["LogicalPlan"], *columns: 
"ColumnOrName") -> None:
         super().__init__(child)
         self._raw_columns = list(columns)
+        _all_of(self._raw_columns, ColumnOrName)

Review Comment:
   The problem is that the user surface of Spark Connect can contain lots of 
untyped code so if we don't do any checking the user gets horrible error 
messages. For example
   
   ```
   >>> df.orderBy(df.id.asc)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/dataframe.py",
 line 202, in __repr__
       return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))
   TypeError: 'Column' object is not iterable
   ```
   
   In this case the error is even worse because it's hidden in the repr. Now if 
you add a collect
   
   ```
   >>> df.orderBy(df.id.asc).collect()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/dataframe.py",
 line 1391, in collect
       pdf = self.toPandas()
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/dataframe.py",
 line 1402, in toPandas
       query = self._plan.to_proto(self._session.client)
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py", 
line 88, in to_proto
       plan.root.CopyFrom(self.plan(session))
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py", 
line 513, in plan
       plan.sort.sort_fields.extend([self.col_to_sort_field(x, session) for x 
in self.columns])
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py", 
line 513, in <listcomp>
       plan.sort.sort_fields.extend([self.col_to_sort_field(x, session) for x 
in self.columns])
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/plan.py", 
line 504, in col_to_sort_field
       sort = SortOrder(ColumnReference(name=col))
     File 
"/Users/martin.grund/Development/spark/python/pyspark/sql/connect/column.py", 
line 255, in __init__
       self._unparsed_identifier = name.name()
   AttributeError: 'function' object has no attribute 'name'
   ```
   
   The error is still not legible.
   
   Yes, for dictionaries this is not applicable.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] grundprinzip commented on a diff in pull request #38879: [SPARK-41362][CONNECT][PYTHON] Better error messages for invalid argument types.

Reply via email to