[ 
https://issues.apache.org/jira/browse/SPARK-36858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428596#comment-17428596
 ] 

Armand BERGES commented on SPARK-36858:
---------------------------------------

Honestly, I feel a little dumb to not to have think at it earlier ... 

I fix our implementation and it is way better ! :)

To be clear, our method is like that : 


{code:java}
def withColumns(cols: Seq[String], columnTransform: String => Column, 
nameTransform: String => String = identity): DataFrame = { 
  // See https://issues.apache.org/jira/browse/SPARK-36858 
  cols.foreach((colName: String) => df = df.withColumn(nameTransform(colName), 
columnTransform(colName))) 
  df 
}
{code}

I think the method signature could easily be improved, and we could discuss 
about it.

Based on your comment, this ticket could probably change in "Add a question in 
some Tutorial" to avoid some noobies to fell into the trap I mention.

Of course, if Spark implements this method with some nice API it could be more 
easier to avoid this trap :) 

> Spark API to apply same function to multiple columns
> ----------------------------------------------------
>
>                 Key: SPARK-36858
>                 URL: https://issues.apache.org/jira/browse/SPARK-36858
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.4.8, 3.1.2
>            Reporter: Armand BERGES
>            Priority: Minor
>
> Hi
> My team and I have regularly need to apply the same function to multiple 
> columns at once.
> For example, we want to remove all non alphanumerical characters to each 
> columns of our dataframes. 
> When we hit this use case first, some people in my team were using this kind 
> of code : 
> {code:java}
> val colListToClean = .... ## Generate some list, could be very long.
> val dfToClean: DataFrame = ... ## This is the dataframe we want to clean
> def cleanFunction(colName: String): Column = ... ## Write some function to 
> manipulate column based on its name.
> val dfCleaned = colListToClean.foldLeft(dfToClean)((df, colName) => 
> df.withColumn(colName, cleanFunction(colName)){code}
> This kind of code when applied on a large set of columns overloaded our 
> driver (because a Dataframe is generated for each column to clean).
> Based on this issue, we developed some code to add two functions : 
>  * One to apply the same function to multiple columns
>  * One to rename multiple columns based on a Map. 
>  
> I wonder if your ever ask your team to add such kind of API ? If you did, had 
> you any kind of issue regarding the implementation ? If you didn't, is this 
> any idea you could add to Spark ? 
> Best regards, 
>  
> LvffY
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to