amaliujia commented on code in PR #38276:
URL: https://github.com/apache/spark/pull/38276#discussion_r1002848070


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -91,6 +93,37 @@ class SparkConnectPlanner(plan: proto.Relation, session: 
SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDeduplicate(rel: proto.Deduplicate): LogicalPlan = {
+    if (!rel.hasInput) {
+      throw InvalidPlanInput("Deduplicate needs a plan input")
+    }
+    if (rel.getAllColumnsAsKeys && rel.getColumnNamesCount > 0) {
+      throw InvalidPlanInput("Cannot deduplicate on both all columns and a 
subset of columns")
+    }
+    if (!rel.getAllColumnsAsKeys && rel.getColumnNamesCount == 0) {

Review Comment:
   > I mean, we don't need `rel.getAllColumnsAsKeys` condition because we can 
know that's the case when `rel.getColumnNamesCount == 0`.
   
   I want to clarify this specifically:
   
   This is one of the Connect proto API design principle: we need to 
differentiate if a field is set or not set explicitly, or put it in another 
way, every intention should be expressed explicitly.  Ultimately, this is to 
avoid ambiguity on the API surface.
   
   One example is Project. If we see a Project without anything in the project 
list, then how do we interpret that? Does the user want to indicate a `SELECT 
*`? Does the user actually generate an invalid plan. The problem now is there 
are two possibilities for a plan, and the worse part is, one possibility is a 
valid plan, another is not. This led us explicitly encode `SELECT *` into the 
proto https://github.com/apache/spark/pull/38023.
   
   So one of the reasons that we have a bool flag here is to not use 
`rel.getColumnNamesCount == 0` to infer distinct on all columns which is to 
ambiguity.
   
   This might not be great because a few more fields could bring another 
problem: what if the user set them all. In terms of ambiguity, this is not an 
issue: we know that is an invalid plan without second choice.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to