drernie commented on pull request #32431: URL: https://github.com/apache/spark/pull/32431#issuecomment-1029392644
My experience (and others) suggests that repeatedly calling withColumn is highly inefficient: https://stackoverflow.com/questions/41400504/spark-scala-repeated-calls-to-withcolumn-using-the-same-function-on-multiple-c/41400588#41400588 The suggested alternative is using select in a very non-obvious way: ``` df.select( "*", # selects all existing columns *[ F.sum(col).over(windowval).alias(col_name) for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"]) ] ) ``` Which doesn't even seem to be documented for Python: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.select.html I would greatly appreciate this API being made available, as it would greatly enhance the performance and reliability of my notebooks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
