[GitHub] [spark] Yikun commented on a change in pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

GitBox Wed, 21 Apr 2021 18:38:11 -0700


Yikun commented on a change in pull request #32276:
URL: https://github.com/apache/spark/pull/32276#discussion_r617626255




##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2451,10 +2454,26 @@ def withColumn(self, colName, col):
         --------
         >>> df.withColumn('age2', df.age + 2).collect()
         [Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
-
+        >>> df.withColumn(['age2', 'age3'], [df.age + 2, df.age + 3]).collect()
+        [Row(age=2, name='Alice', age2=4, age3=5), Row(age=5, name='Bob', 
age2=7, age3=8)]
         """
-        assert isinstance(col, Column), "col should be Column"
-        return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
+        if not isinstance(colName, (str, list, tuple)):
+            raise TypeError("colName must be string or list/tuple of column 
names.")
+        if not isinstance(col, (Column, list, tuple)):
+            raise TypeError("col must be a column or list/tuple of columns.")
+
+        # Covert the colName and col to list
+        col_names = [colName] if isinstance(colName, str) else colName
+        col = [col] if isinstance(col, Column) else col
+
+        # Covert tuple to list
+        col_names = list(col_names) if isinstance(col_names, tuple) else 
col_names
+        col = list(col) if isinstance(col, tuple) else col
+
+        return DataFrame(
+            self._jdf.withColumns(_to_seq(self._sc, col_names), 
self._jcols(col)),

Review comment:
       Notice that I use the `withColumns` in here which is a [private 
method](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402)
 in scala. if we use the `withColumn`, it will raise an unmatch error, because 
the function assginment can be recoginized by py4j.
   
   Addition note:  I didn't expose the **private withColumns** API in PySpark, 
just to match the ability of the scala withColumn API. That means, the scala 
[withColumn 
API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396)
 can receive multiple columns now, and support by calling the [internal 
withColumns 
API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396-L2402).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a change in pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Reply via email to