drernie commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029392644


   My experience (and others) suggests that repeatedly calling withColumn is 
highly inefficient:
   
   
https://stackoverflow.com/questions/41400504/spark-scala-repeated-calls-to-withcolumn-using-the-same-function-on-multiple-c/41400588#41400588
   
   The suggested alternative is using select in a very non-obvious way:
   ```
               df.select(
                   "*", # selects all existing columns
                   *[
                       F.sum(col).over(windowval).alias(col_name)
                       for col, col_name in zip(["A", "B", "C"], ["cumA", 
"cumB", "cumC"])
                   ]
               )
   ```
   Which doesn't even seem to be documented for Python:
   
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.select.html
   
   I would greatly appreciate this API being made available, as it would 
greatly enhance the performance and reliability of my notebooks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to