[GitHub] [spark] rahulsmahadev commented on a change in pull request #33336: [SPARK-36132][SS][SQL] Support initial state for batch mode of flatMapGroupsWithState

GitBox Tue, 20 Jul 2021 16:27:29 -0700


rahulsmahadev commented on a change in pull request #33336:
URL: https://github.com/apache/spark/pull/33336#discussion_r670821139




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FlatMapGroupsWithStateExec.scala
##########
@@ -404,3 +402,72 @@ case class FlatMapGroupsWithStateExec(
     copy(child = newLeft, initialState = newRight)
 }
 
+object FlatMapGroupsWithStateExec {
+
+  def foundDuplicateInitialKeyException(): Exception = {
+    throw new IllegalArgumentException("The initial state provided contained " 
+
+      "multiple rows(state) with the same key. Make sure to de-duplicate the " 
+
+      "initial state before passing it.")
+  }
+
+  /**
+   * Special handling for when the child relation is a batch relation.
+   * If the initial state is provided, we create an instance of the 
CoGroupExec, if the initial
+   * state is not provided we create an instance of the MapGroupsExec
+   */
+  // scalastyle:off argcount
+  def forBatch(
+      userFunc: (Any, Iterator[Any], LogicalGroupState[Any]) => Iterator[Any],
+      keyDeserializer: Expression,
+      valueDeserializer: Expression,
+      initialStateDeserializer: Expression,
+      groupingAttributes: Seq[Attribute],
+      initialStateGroupAttrs: Seq[Attribute],
+      dataAttributes: Seq[Attribute],
+      initialStateDataAttrs: Seq[Attribute],
+      outputObjAttr: Attribute,
+      timeoutConf: GroupStateTimeout,
+      hasInitialState: Boolean,
+      initialState: SparkPlan,
+      child: SparkPlan): SparkPlan = {
+    if (hasInitialState) {
+      val watermarkPresent = child.output.exists {
+        case a: Attribute if a.metadata.contains(EventTimeWatermark.delayKey) 
=> true
+        case _ => false
+      }
+      val func = (keyRow: Any, values: Iterator[Any], states: Iterator[Any]) 
=> {
+        // Check if there is only one state for every key.
+        var foundInitialStateForKey = false
+        val optionalState = states.map { initialState =>
+          if (foundInitialStateForKey) {
+            foundDuplicateInitialKeyException()
+          }
+          foundInitialStateForKey = true
+          initialState
+        }.toSeq
+
+        // Create group state object
+        val groupState = GroupStateImpl.createForStreaming(

Review comment:
       createForStreaming is a misnomer actually. it is just create a 
groupStateImpl object using the parameters 
https://livegrep.dev.databricks.com/view/databricks/runtime/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/GroupStateImpl.scala#L202
 
   
   createForbatch is infact creating groupStateImpl without the state
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] rahulsmahadev commented on a change in pull request #33336: [SPARK-36132][SS][SQL] Support initial state for batch mode of flatMapGroupsWithState

Reply via email to