amaliujia commented on PR #37710:
URL: https://github.com/apache/spark/pull/37710#issuecomment-1245771985
My high level concern about this init PR is that it introduces a lot of code
which is beyond what is required for a minimal valued product, and those code
are not well tested.
IMO, a MVP is just one that demonstrates the client-server architecture,
client can do a single select from a table, and server side execute the plan
and return the data. This PR adds a lot more on the API surface, like join,
aggregate, etc., which is beyond the MVP requirement. And because those surface
are not well tested, I would expect many improvements on those surface to
happen.
The direct consequence is we will need to file commits afterwards to update
the dumped code. Many will fall into cleanup/bug fix/missing tests, etc. But
with a MVP, this could be minimized. The following could be an example that I
was thinking of:
With this init PR, we could have some commits in the future:
`[SPARK-xxxxx][CONNECT] fix outer join implementation due to xxx`
`[SPARK-xxxxx][CONNECT] add test for non-equality join condition`
`[SPARK-xxxxx][CONNECT] fix issues for nested join due to xxx`
`[SPARK-xxxxx][CONNECT] fix issues when join condition contains a function`
With MVP which does not have join on the surface at the beginning, the
commits we bring in could be
`[SPARK-xxxxx][CONNECT] init inner equality join support with tests. Throw
exception for non-inner joins`
`[SPARK-xxxxx][CONNECT][FOLLOWUP] improve inner equality join and more tests`
`[SPARK-xxxxx][CONNECT][FOLLOWUP] bug fix for inner equality join on xxx`
`[SPARK-xxxxx][CONNECT] extend join types to outer joins, with tests for
situations a, b, c, d...`
`[SPARK-xxxxx][CONNECT][FOLLOWUP] more tests for outer joins types on
uncovered situations f, k, h..`
For maintaining the quality of the commit history purpose, the MVP path
might be better.
The longer term impact is for the first release we want for the connect
project. When we plan to release the first version of it, generally speaking we
won't want to release some API that are not well tested or even functional
incorrect. Eventually we either make sure each API is well implemented and
tested, or we choose to block those API from being used, which already lead to
that we need a version that is implemented good first, then incrementally
append high quality changes. Then why don't we target to achieve it at very
beginning?
My proposal is really to only keep minimal required code that demonstrates
the client-server architecture and client can issue a simple operation.
This is not blocking this PR and I am open to this concern. As long as Spark
community is ok for the big dump, it's good to go. Just want to make sure some
perspectives are covered and we can make good decisions on valid perspectives.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]