mgaido91 commented on issue #23285: [SPARK-26224][SQL] Avoid creating many project on subsequent calls to withColumn URL: https://github.com/apache/spark/pull/23285#issuecomment-447268395 @HeartSaVioR I am just telling you what is my experience: I remember in one of my very first work with Spark that I used `withColumn` too in a for loop because it was easier/more convenient to work with one expression per time for me. When I realized that it was a bad idea for this reason, then I think that having a `withColumns` or using `select` doesn't make a big difference, as in any case you have to build your columns in advance and then pass them to the method. In this sense, I don't see the `withColumns` method being useful. As an alternative, I'd propose here to check if there are several project on the top (we can define a threshold, eg. 50), when calling `withColumn` and in that case emit a warning saying something like: "Your plan contains a may Project nodes on top of each other. This usually happens if you are using withColumn in a for loop and you are adding many columns. Doing this is highly discouraged and can cause serious issues. Please use a single select and add all your new columns to it instead.". What do you think? cc @cloud-fan @viirya too.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
