tdas commented on a change in pull request #33093:
URL: https://github.com/apache/spark/pull/33093#discussion_r662348054



##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FlatMapGroupsWithStateSuite.scala
##########
@@ -1268,12 +1269,298 @@ class FlatMapGroupsWithStateSuite extends 
StateStoreMetricsTest {
     assert(e.getMessage === "The output mode of function should be append or 
update")
   }
 
+  import testImplicits._
+
+  /**
+   * FlatMapGroupsWithState function that returns the key, value as passed to 
it
+   * along with the updated state. The state is incremented for every value.
+   */
+  val flatMapGroupsWithStateFunc =
+    (key: String, values: Iterator[String], state: GroupState[RunningCount]) 
=> {
+      val valList = values.toSeq
+      val count = state.getOption.map(_.count).getOrElse(0L) + valList.size
+      state.update(new RunningCount(count))
+      Iterator((key, valList, state.get.count.toString))
+    }
+
+  Seq("1", "2", "6").foreach { shufflePartitions =>
+    testWithAllStateVersions(s"flatMapGroupsWithState - initial " +
+        s"state - all cases - shuffle partitions ${shufflePartitions}") {
+      withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> shufflePartitions) {
+        // We will test them on different shuffle partition configuration to 
make sure the
+        // grouping by key will still work. On higher number of shuffle 
partitions its possible
+        // that all keys end up on different partitions.
+        val initialState: Dataset[(String, RunningCount)] = Seq(
+          ("keyInStateAndData-1", new RunningCount(1)),
+          ("keyInStateAndData-2", new RunningCount(1)),
+          ("keyOnlyInState-1", new RunningCount(2)),
+          ("keyOnlyInState-2", new RunningCount(1))
+        ).toDS()
+
+        val it = initialState.groupByKey(x => x._1).mapValues(_._2)
+        val inputData = MemoryStream[String]
+        val result =
+          inputData.toDS()
+            .groupByKey(x => x)
+            .flatMapGroupsWithState(
+              Update, GroupStateTimeout.NoTimeout, 
it)(flatMapGroupsWithStateFunc)
+
+        testStream(result, Update)(
+          AddData(inputData, "keyOnlyInData", "keyInStateAndData-1"),
+          CheckNewAnswer(
+            ("keyOnlyInState-1", Seq[String](), "2"),
+            ("keyOnlyInState-2", Seq[String](), "1"),
+            ("keyInStateAndData-1", Seq[String]("keyInStateAndData-1"), "2"), 
// inc by 1
+            ("keyInStateAndData-2", Seq[String](), "1"),
+            ("keyOnlyInData", Seq[String]("keyOnlyInData"), "1") // inc by 1
+          ),
+          assertNumStateRows(total = 5, updated = 5),

Review comment:
       You are not testing whether the initial group state is actually being 
saved or not. you could have just created th e input GroupState object with the 
initial state and not saved to state store, and this test will still pass. So 
you need to run another batch to retrieve and test the save state.
   
   Furthermore, you need to explicitly test whether the initial state is saved 
to store even if you dont call `GroupState.update()`. Right now in your test 
function, you are always calling update. So even if you incorrectly did not 
save the initial state store, the update will always make sure the state store 
is updated. So you need to test for more cases, with more keys.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to