amaliujia commented on PR #37710:
URL: https://github.com/apache/spark/pull/37710#issuecomment-1245771985

   My high level concern about this init PR is that it introduces a lot of code 
which is beyond what is required for a minimal valued product, and those code 
are not well tested.
   
   IMO, a MVP is just one that demonstrates the client-server architecture, 
client can do a single select from a table, and server side execute the plan 
and return the data. This PR adds a lot more on the API surface, like join, 
aggregate, etc., which is beyond the MVP requirement. And because those surface 
are not well tested, I would expect many improvements on those surface to 
happen.
   
   The direct consequence is we will need to file commits afterwards to update 
the dumped code. Many will fall into cleanup/bug fix/missing tests, etc. But 
with a MVP, this could be minimized. The following could be an example that I 
was thinking of:
   
   With this init PR, we could have some commits in the future:
   `[SPARK-xxxxx][CONNECT] fix outer join implementation due to xxx`
   `[SPARK-xxxxx][CONNECT] add test for non-equality join condition`
   `[SPARK-xxxxx][CONNECT] fix issues for nested join due to xxx`
   `[SPARK-xxxxx][CONNECT] fix issues when join condition contains a function`
   
   With MVP which does not have join on the surface at the beginning, the 
commits we bring in could be
   `[SPARK-xxxxx][CONNECT] init inner equality join support with tests. Throw 
exception for non-inner joins`
   `[SPARK-xxxxx][CONNECT][FOLLOWUP] improve inner equality join and more tests`
   `[SPARK-xxxxx][CONNECT][FOLLOWUP] bug fix for inner equality join on xxx`
   `[SPARK-xxxxx][CONNECT] extend join types to outer joins, with tests for 
situations a, b, c, d...`
   `[SPARK-xxxxx][CONNECT][FOLLOWUP] more tests for outer joins types on 
uncovered situations f, k, h..`
   
   
   For maintaining the quality of the commit history purpose, the MVP path 
might be better. 
   
   
   The longer term impact is for the first release we want for the connect 
project. When we plan to release the first version of it, generally speaking we 
won't want to release some API that are not well tested or even functional 
incorrect.  Eventually we either make sure each API is well implemented and 
tested, or we choose to block those API from being used, which already lead to 
that we need a version that is implemented good first, then incrementally 
append high quality changes. Then why don't we target to achieve it at very 
beginning?
   
   
   My proposal is really to only keep minimal required code that demonstrates 
the client-server architecture and client can issue a simple operation.
   
   
   This is not blocking this PR and I am open to this concern. As long as Spark 
community is ok for the big dump, it's good to go.  Just want to make sure some 
perspectives are covered and we can make good decisions on valid perspectives.
   
   
   
   
   
   
   
   
   
   
      
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to