Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

via GitHub Wed, 03 Jul 2024 08:51:25 -0700


ericm-db commented on code in PR #47104:
URL: https://github.com/apache/spark/pull/47104#discussion_r1664418795



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala:
##########
@@ -187,23 +187,33 @@ class IncrementalExecution(
     }
   }
 
-  object WriteStatefulOperatorMetadataRule extends SparkPlanPartialRule {
+  // Planning rule used to record the state schema for the first run and 
validate state schema
+  // changes across query runs.
+  object StateSchemaAndOperatorMetadataRule extends SparkPlanPartialRule {
     override val rule: PartialFunction[SparkPlan, SparkPlan] = {
+      // In the case of TransformWithStateExec, we want to collect this 
StateSchema
+      // filepath, and write this path out in the OperatorStateMetadata file
       case stateStoreWriter: StateStoreWriter if isFirstBatch =>
+        val stateSchemaVersion = stateStoreWriter match {
+          case _: TransformWithStateExec => sparkSession.sessionState.conf.
+            
getConf(SQLConf.STREAMING_TRANSFORM_WITH_STATE_OP_STATE_SCHEMA_VERSION)
+          case _ => 2
+        }
+        val stateSchemaPaths =
+          stateStoreWriter.validateAndMaybeEvolveStateSchema(
+            hadoopConf,
+            currentBatchId,
+            stateSchemaVersion)
+        // write out the state schema paths to the metadata file
         val metadata = stateStoreWriter.operatorStateMetadata()
+        // TODO: Populate metadata with stateSchemaPaths if metadata version 
is v2
         val metadataWriter = new OperatorStateMetadataWriter(new Path(
           checkpointLocation, 
stateStoreWriter.getStateInfo.operatorId.toString), hadoopConf)
         metadataWriter.write(metadata)
         stateStoreWriter
-    }
-  }
-
-  // Planning rule used to record the state schema for the first run and 
validate state schema
-  // changes across query runs.
-  object StateSchemaValidationRule extends SparkPlanPartialRule {
-    override val rule: PartialFunction[SparkPlan, SparkPlan] = {
       case statefulOp: StatefulOperator if isFirstBatch =>
-        statefulOp.validateAndMaybeEvolveStateSchema(hadoopConf)
+        statefulOp.
+          validateAndMaybeEvolveStateSchema(hadoopConf, currentBatchId, 
stateSchemaVersion = 2)

Review Comment:
   Sorry, maybe I just didn't get it, but that's what we're doing 
[here](https://github.com/apache/spark/pull/47104/files#diff-daf798339ed682fd96e86a81b7b89ed220f1abc3bac270ddf919335b6bd2583dR206)
 right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48726] Create the StateSchemaV3 file format, and write this out for the TransformWithStateExec operator [spark]

Reply via email to