xuanyuanking commented on a change in pull request #29256:
URL: https://github.com/apache/spark/pull/29256#discussion_r471988853
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
##########
@@ -1106,6 +1107,54 @@ class StreamingQuerySuite extends StreamTest with
BeforeAndAfter with Logging wi
}
}
+ test("union in streaming query of append mode without watermark") {
+ val inputData1 = MemoryStream[Int]
+ val inputData2 = MemoryStream[Int]
+ withTempView("s1", "s2") {
+ inputData1.toDF().createOrReplaceTempView("s1")
+ inputData2.toDF().createOrReplaceTempView("s2")
+ val unioned = spark.sql(
+ "select s1.value from s1 union select s2.value from s2")
+ checkExceptionMessage(unioned)
+ }
+ }
+
+ test("distinct in streaming query of append mode without watermark") {
+ val inputData = MemoryStream[Int]
+ withTempView("deduptest") {
+ inputData.toDF().toDF("value").createOrReplaceTempView("deduptest")
+ val distinct = spark.sql("select distinct value from deduptest")
+ checkExceptionMessage(distinct)
+ }
+ }
+
+ test("distinct in streaming query of complete mode") {
+ val inputData = MemoryStream[Int]
+ withTempView("deduptest") {
+ inputData.toDF().toDF("value").createOrReplaceTempView("deduptest")
+ val distinct = spark.sql("select distinct value from deduptest")
+
+ testStream(distinct, Complete)(
+ AddData(inputData, 1, 2, 3, 3, 4),
+ CheckAnswer(Row(1), Row(2), Row(3), Row(4))
Review comment:
```
Are you sure that can be interpreted as same for end users? Is there any
mention of Dataset API vs SQL clause in the statement?
```
Let me clarify more. I'm not sure so I agree we should have more documents.
What I'm suggesting is this document actually is not Streaming specific, it's
the inconsistency between SQL and Dataset API.
For the structured streaming guide doc, I think the `distinct` it only
refers to Dataset.distinct, not the SQL DISTINCT clause.
```
That's clearly described in the doc. The thing is that SQL UNION cannot be
done in streaming, as according to the doc, SQL UNION is Dataset.union +
distinct but distinct is not supported.
```
Yep, so I think we are on the same page. That's not clear enough right? The
best way is to directly tell the end user SQL union not support in Streaming,
because it's relay on distinct. The root cause still the inconsistency between
SQL and Dataset.
```
So this is another one being enabled as side effects. You can't do that
with Dataset API.
```
That's why my suggestion is making the distinct and drop duplicates have the
same behavior.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]