Armand BERGES created SPARK-36858:
-------------------------------------

             Summary: Spark API to apply same function to multiple columns
                 Key: SPARK-36858
                 URL: https://issues.apache.org/jira/browse/SPARK-36858
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 3.1.2, 2.4.8
            Reporter: Armand BERGES


Hi

My team and I have regularly need to apply the same function to multiple 
columns at once.

For example, we want to remove all non alphanumerical characters to each 
columns of our dataframes. 

When we hit this use case first, some people in my team were using this kind of 
code : 


{code:java}
val colListToClean = .... ## Generate some list, could be very long.
val dfToClean: DataFrame = ... ## This is the dataframe we want to clean
def cleanFunction(colName: String): Column = ... ## Write some function to 
manipulate column based on its name.
val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => 
df.withColumn(colName, cleanFunction(colName)){code}

This kind of code when applied on a large set of columns overloaded our driver 
(because a Dataframe is generated for each column to clean).

Based on this issue, we developed some code to add two functions : 


 * One to apply the same function to multiple columns
 * One to rename multiple columns based on a Map. 

 

I wonder if your ever ask your team to add such kind of API ? If you did, had 
you any kind of issue regarding the implementation ? If you didn't, is this any 
idea you could add to Spark ? 

Best regards, 

 

LvffY

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to