hvanhovell commented on PR #38793: URL: https://github.com/apache/spark/pull/38793#issuecomment-1329420695
@grundprinzip IMO we should go for a server side implementation of `withColumns` for the following reasons: - Connect in its current form is lazy. Going the project route would violate that, because we would need to execute a RPC to get the schema. This in itself is not a problem, however when the schema of the table changes you might end up with missing columns or errors. - Code reuse. As @amaliujia mentioned, we are likely to have `withColumns` in multiple clients (scala will definitely have this as well). I want to drive home the point here that withColumns has some non-trivial name resolution logic (with name parsing and a bunch of other things), I think it is better to do it once, and have a consistent user experience. - Customers love to stack withColumns on top of each other (e.g. do analysis of all columns in a dataframe in a loop). I would love to move this into the analyzer at some point. That should make it quite a bit faster. Taking a step back I think we need some decision framework on where to implement what. In this case the logic is non-trivial and I think it should belong on the sever. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
