mgaido91 commented on issue #23285: [SPARK-26224][SQL] Avoid creating many 
project on subsequent calls to withColumn
URL: https://github.com/apache/spark/pull/23285#issuecomment-447268395
 
 
   @HeartSaVioR I am just telling you what is my experience: I remember in one 
of my very first work with Spark that I used `withColumn` too in a for loop 
because it was easier/more convenient to work with one expression per time for 
me. When I realized that it was a bad idea for this reason, then I think that 
having a `withColumns` or using `select` doesn't make a big difference, as in 
any case you have to build your columns in advance and then pass them to the 
method. In this sense, I don't see the `withColumns` method being useful.
   
   As an alternative, I'd propose here to check if there are several project on 
the top (we can define a threshold, eg. 50), when calling `withColumn` and in 
that case emit a warning saying something like: "Your plan contains a may 
Project nodes on top of each other. This usually happens if you are using 
withColumn in a for loop and you are adding many columns. Doing this is highly 
discouraged and can cause serious issues. Please use a single select and add 
all your new columns to it instead.". What do you think? cc @cloud-fan @viirya 
too.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to