HeartSaVioR commented on a change in pull request #29256:
URL: https://github.com/apache/spark/pull/29256#discussion_r469665058
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
##########
@@ -1106,6 +1107,54 @@ class StreamingQuerySuite extends StreamTest with
BeforeAndAfter with Logging wi
}
}
+ test("union in streaming query of append mode without watermark") {
+ val inputData1 = MemoryStream[Int]
+ val inputData2 = MemoryStream[Int]
+ withTempView("s1", "s2") {
+ inputData1.toDF().createOrReplaceTempView("s1")
+ inputData2.toDF().createOrReplaceTempView("s2")
+ val unioned = spark.sql(
+ "select s1.value from s1 union select s2.value from s2")
+ checkExceptionMessage(unioned)
+ }
+ }
+
+ test("distinct in streaming query of append mode without watermark") {
+ val inputData = MemoryStream[Int]
+ withTempView("deduptest") {
+ inputData.toDF().toDF("value").createOrReplaceTempView("deduptest")
+ val distinct = spark.sql("select distinct value from deduptest")
+ checkExceptionMessage(distinct)
+ }
+ }
+
+ test("distinct in streaming query of complete mode") {
+ val inputData = MemoryStream[Int]
+ withTempView("deduptest") {
+ inputData.toDF().toDF("value").createOrReplaceTempView("deduptest")
+ val distinct = spark.sql("select distinct value from deduptest")
+
+ testStream(distinct, Complete)(
+ AddData(inputData, 1, 2, 3, 3, 4),
+ CheckAnswer(Row(1), Row(2), Row(3), Row(4))
Review comment:
I'd be more puzzled we differentiate the behavior of streaming query via
DF vs SQL. (I never use SQL for SS query though.) Do we think end users can
make correct assumption on differentiating twos? Otherwise do we have any guide
or note on the doc?
After the fix, end users have to infer union & distinct triggers grouping
aggregation and hence result table, even without explicit group by. I agree
this patch fixes the thing which didn't work, but brings confusion on reasoning
about "implicit behavior" which depends on the implementation details for
Spark. Probably we should make it clear "grouped aggregation" includes implicit
aggregation in the guide doc.
And I'm wondering whether we track a list of "known issues" and any plan to
fix them. The problems on update/complete mode are already pointed out during
couple of Spark 3.0.0 related issues, and I don't see any plan on addressing
it. I even proposed to replace complete mode with viable alternatives but
brought the debates and no productive discussion happened. I don't think being
conservative means we live with known problems.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]