Re: [PR] [SPARK-50404][PYTHON] PySpark DataFrame Pipe Method [spark]

via GitHub Sat, 30 Nov 2024 07:36:37 -0800


lucas-nelson-uiuc commented on code in PR #48947:
URL: https://github.com/apache/spark/pull/48947#discussion_r1857165004



##########
python/pyspark/sql/classic/dataframe.py:
##########
@@ -1699,6 +1700,19 @@ def transform(
         ), "Func returned an instance of type [%s], " "should have been 
DataFrame." % type(result)
         return result
 
+    def pipe(
+        self, *funcs: tuple[Callable[..., ParentDataFrame]]
+    ) -> ParentDataFrame:
+        result = functools.reduce(
+            lambda init, func: init.transform(func),

Review Comment:
   @HyukjinKwon  I feel like the same could be said about `DataFrame.transform` 
- the source code for that method is simply calling the passed function with 
optional arguments/keyword-arguments.
   
   I agree that this isn't a difficult implementation - however, it does seem 
like a natural next step for applying custom transformations. As of right now, 
users can use:
   - Nested functions: `h(g(f(input_df)))`
   - Transformation method: `input_df.transform(f).transform(g).transform(h)`
   - Pipe method: `input_df.pipe(f, g, h)`
   
   In my opinion, using the pipe method is much easier to read than nesting 
functions and more succinct than chaining transforms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50404][PYTHON] PySpark DataFrame Pipe Method [spark]

Reply via email to