[GitHub] [spark] hvanhovell commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

GitBox Mon, 28 Nov 2022 08:50:32 -0800


hvanhovell commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1329420695


   @grundprinzip IMO we should go for a server side implementation of 
`withColumns` for the following reasons:
   - Connect in its current form is lazy. Going the project route would violate 
that, because we would need to execute a RPC to get the schema. This in itself 
is not a problem, however when the schema of the table changes you might end up 
with missing columns or errors.
   - Code reuse. As @amaliujia mentioned, we are likely to have `withColumns` 
in multiple clients (scala will definitely have this as well). I want to drive 
home the point here that withColumns has some non-trivial name resolution logic 
(with name parsing and a bunch of other things), I think it is better to do it 
once, and have a consistent user experience.
   - Customers love to stack withColumns on top of each other (e.g. do analysis 
of all columns in a dataframe in a loop). I would love to move this into the 
analyzer at some point. That should make it quite a bit faster.
   
   Taking a step back I think we need some decision framework on where to 
implement what. In this case the logic is non-trivial and I think it should 
belong on the sever.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] hvanhovell commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Reply via email to