lucas-nelson-uiuc opened a new pull request, #48947:
URL: https://github.com/apache/spark/pull/48947

   Suggested implementation for a pipe method in the PySpark DataFrame. Similar 
to the transform method, this method can be called directly on a DataFrame to 
perform custom transformations functions. However, unlike the current transform 
method which requires one call per custom transformation, pipe can accept an 
ambiguous number of transformations and chain them together on the user's 
behalf.
   
   Using the existing documentation for `DataFrame.transform`, the suggested 
pipe method would look like such:
   
   ```python
   from pyspark.sql.functions import col
   
   
   df = spark.createDataFrame([(1, 1.0), (2, 2.0)], ["int", "float"])
   
   def cast_all_to_int(input_df):
       return input_df.select([col(col_name).cast("int") for col_name in 
input_df.columns])
   
   def sort_columns_asc(input_df):
       return input_df.select(*sorted(input_df.columns))
   
   # with transform method
   df.transform(cast_all_to_int).transform(sort_columns_asc).show()
   
   # with pipe method
   df.pipe(cast_all_to_int, sort_columns_asc)
   ```
   
   For functions that take parameters, users can pass closures or partially 
defined functions.
   
   ```python
   from typing import Callable
   import functools
   
   def add_n(input_df, n):
       return input_df.select([(col(col_name) + n).alias(col_name)
                               for col_name in input_df.columns])
   
   # define a partial function
   add_one = functools.partial(add_n, n=1)
   
   # or, define a function that returns a closure
   def add_n(n: int) -> Callable:
       def closure(input_df: DataFrame) -> DataFrame:
           return input_df.select([(col(col_name) + n).alias(col_name)
                               for col_name in input_df.columns])
       return closure
   
   # with transform method
   df.transform(add_n, 1).transform(add_n, n=10).show()
   
   # with pipe method
   df.pipe(add_one, add_n(n=10)).show()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to