[GitHub] [spark] HeartSaVioR commented on a change in pull request #29256: [SPARK-32456][SS] Check the Distinct by assuming it as Aggregate for Structured Streaming

GitBox Wed, 12 Aug 2020 19:48:15 -0700


HeartSaVioR commented on a change in pull request #29256:
URL: https://github.com/apache/spark/pull/29256#discussion_r469665058




##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
##########
@@ -1106,6 +1107,54 @@ class StreamingQuerySuite extends StreamTest with 
BeforeAndAfter with Logging wi
     }
   }
 
+  test("union in streaming query of append mode without watermark") {
+    val inputData1 = MemoryStream[Int]
+    val inputData2 = MemoryStream[Int]
+    withTempView("s1", "s2") {
+      inputData1.toDF().createOrReplaceTempView("s1")
+      inputData2.toDF().createOrReplaceTempView("s2")
+      val unioned = spark.sql(
+        "select s1.value from s1 union select s2.value from s2")
+      checkExceptionMessage(unioned)
+    }
+  }
+
+  test("distinct in streaming query of append mode without watermark") {
+    val inputData = MemoryStream[Int]
+    withTempView("deduptest") {
+      inputData.toDF().toDF("value").createOrReplaceTempView("deduptest")
+      val distinct = spark.sql("select distinct value from deduptest")
+      checkExceptionMessage(distinct)
+    }
+  }
+
+  test("distinct in streaming query of complete mode") {
+    val inputData = MemoryStream[Int]
+    withTempView("deduptest") {
+      inputData.toDF().toDF("value").createOrReplaceTempView("deduptest")
+      val distinct = spark.sql("select distinct value from deduptest")
+
+      testStream(distinct, Complete)(
+        AddData(inputData, 1, 2, 3, 3, 4),
+        CheckAnswer(Row(1), Row(2), Row(3), Row(4))

Review comment:
       I'd be more puzzled we differentiate the behavior of streaming query via 
DF vs SQL. (I never use SQL for SS query though.) Do we think end users can 
make correct assumption on differentiating twos? Otherwise do we have any guide 
or note on the doc?
   
   After the fix, end users have to infer union & distinct triggers grouping 
aggregation and hence result table, even without explicit group by. I agree 
this patch fixes the thing which didn't work, but brings confusion on reasoning 
about "implicit behavior" which depends on the implementation details for 
Spark. Probably we should make it clear "grouped aggregation" includes implicit 
aggregation in the guide doc.
   
   And I'm wondering whether we track a list of "known issues" and any plan to 
fix them. The problems on update/complete mode are already pointed out during 
couple of Spark 3.0.0 related issues, and I don't see any plan on addressing 
it. I even proposed to replace complete mode with viable alternatives but 
brought the debates and no productive discussion happened. I don't think being 
conservative means we live with known problems. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #29256: [SPARK-32456][SS] Check the Distinct by assuming it as Aggregate for Structured Streaming

Reply via email to