santosh-d3vpl3x commented on code in PR #37335:
URL: https://github.com/apache/spark/pull/37335#discussion_r933033449
##########
python/pyspark/sql/dataframe.py:
##########
@@ -3244,10 +3244,14 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":
# type: ignore[misc]
else:
raise TypeError("col should be a string or a Column")
else:
- for col in cols:
- if not isinstance(col, str):
- raise TypeError("each col in the param list should be a
string")
- jdf = self._jdf.drop(self._jseq(cols))
+ if all(isinstance(col, str) for col in cols):
+ jdf = self._jdf.drop(self._jseq(cols))
+ elif all(isinstance(col, Column) for col in cols):
+ jdf = self._jdf
+ for col in cols:
+ jdf = jdf.drop(col._jc) # type: ignore[union-attr]
Review Comment:
> Can we avoid looping here? This is super expensive in Spark SQL optmizer.
Excellent remark. I tried to find a proper way before going with this
approach. Please bear with me as the explanation is long-ish.
> Ideally we should add the signature of `def drop(colNames: Column*` in
Scala side first, and PySpark side directlly invokes it.
I started out with this exact assumption but it doesn't hold ground due to
runtime type erasure. I need some inputs from committers before I can make a
sane choice.
1. Addition of `def drop(cols: Column*)` wouldn't work. `def drop(cols:
String*)` already exists leading to double definition because they have same
type after erasure: (cols: Seq)org.apache.spark.sql.Dataset
2. We could add `def drop(col: Column, cols: Column*)`. That helps on JVM
already but on pyspark it doesn't work as `def drop(colNames: String*)` gets
precedence. `self._jdf.drop(self._jseq(cols, _to_java_column))` will always
pick up `def drop(cols: String*)` at runtime and lead to an error. This is
actually worse than where we started.
3. Not recommended: We could change `def drop(colNames: String*)` to `def
drop(colName:String, colNames: String*)` but then that is a breaking change for
JVM. However, other bindings are pretty happy.
4. Compromise: We could add `def dropMultiple(cols: Column*)` and `def
dropMultiple(col: String, cols: String*)`. This is a middle ground but at the
cost of an addition to public API. I am not a big fan of adding public API as a
compromise.
5. Add a Helper function as a workaround. I am not very sure how this would
look like.
6. Change Python signature.
What would you suggest @HyukjinKwon ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]