[GitHub] [spark] santosh-d3vpl3x commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

GitBox Fri, 29 Jul 2022 02:12:19 -0700


santosh-d3vpl3x commented on code in PR #37335:
URL: https://github.com/apache/spark/pull/37335#discussion_r933033449



##########
python/pyspark/sql/dataframe.py:
##########
@@ -3244,10 +3244,14 @@ def drop(self, *cols: "ColumnOrName") -> "DataFrame":  
# type: ignore[misc]
             else:
                 raise TypeError("col should be a string or a Column")
         else:
-            for col in cols:
-                if not isinstance(col, str):
-                    raise TypeError("each col in the param list should be a 
string")
-            jdf = self._jdf.drop(self._jseq(cols))
+            if all(isinstance(col, str) for col in cols):
+                jdf = self._jdf.drop(self._jseq(cols))
+            elif all(isinstance(col, Column) for col in cols):
+                jdf = self._jdf
+                for col in cols:
+                    jdf = jdf.drop(col._jc)  # type: ignore[union-attr]

Review Comment:
   > Can we avoid looping here? This is super expensive in Spark SQL optmizer. 
   
   Excellent remark. I tried to find a proper way before going with this 
approach. Please bear with me as the explanation is long-ish.
   
   > Ideally we should add the signature of `def drop(colNames: Column*` in 
Scala side first, and PySpark side directlly invokes it.
   
   I started out with this exact assumption but it doesn't hold ground due to 
runtime type erasure. I need some inputs from committers before I can make a 
sane choice.
   1. Addition of `def drop(cols: Column*)` wouldn't work. `def drop(cols: 
String*)` already exists leading to double definition because they have same 
type after erasure: (cols: Seq)org.apache.spark.sql.Dataset
   2. We could add `def drop(col: Column, cols: Column*)`. That helps on JVM 
already but on pyspark it doesn't work as `def drop(colNames: String*)` gets 
precedence. `self._jdf.drop(self._jseq(cols, _to_java_column))` will always 
pick up `def drop(cols: String*)` at runtime and lead to an error. This is 
actually worse than where we started.
   3. Not recommended: We could change `def drop(colNames: String*)` to `def 
drop(colName:String, colNames: String*)` but then that is a breaking change for 
JVM. However, other bindings are pretty happy.
   4. Compromise: We could add `def dropMultiple(cols: Column*)` and `def 
dropMultiple(col: String, cols: String*)`. This is a middle ground but at the 
cost of an addition to public API. I am not a big fan of adding public API as a 
compromise.
   5. Add a Helper function as a workaround. I am not very sure how this would 
look like.
   6. Change Python signature.
   
   What would you suggest @HyukjinKwon ?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] santosh-d3vpl3x commented on a diff in pull request #37335: [SPARK-39895][PYTHON] Support multiple column drop

Reply via email to