Re: [PR] [SPARK-49115][DOCS] Add docs for state metadata source for operators using schema format v2 such as transformWithState [spark]

via GitHub Wed, 07 Aug 2024 09:54:20 -0700


anishshri-db commented on code in PR #47625:
URL: https://github.com/apache/spark/pull/47625#discussion_r1707446767



##########
docs/structured-streaming-state-data-source.md:
##########
@@ -264,13 +264,14 @@ The output schema will also be different from the normal 
output.
 </tr>
 </table>
 
-## State metadata source
+## State Metadata Source
 
 Before querying the state from existing checkpoint via state data source, 
users would like to understand the information for the checkpoint, especially 
about state operator. This includes which operators and state store instances 
are available in the checkpoint, available range of batch IDs, etc.
 
 Structured Streaming provides a data source named "State metadata source" to 
provide the state-related metadata information from the checkpoint.
 
 Note: The metadata is constructed when the streaming query is running with 
Spark 4.0+. The existing checkpoint which has been running with lower Spark 
version does not have the metadata and will be unable to query/use with this 
metadata source. It is required to run the streaming query pointing the 
existing checkpoint in Spark 4.0+ to construct the metadata before querying.
+For operators using the operator metadata format version 1, the metadata is 
written once and does not change. For operators using metadata format version 
2, this metadata can change and can be queried by providing the relevant 
batchId.

Review Comment:
   Done



##########
docs/structured-streaming-state-data-source.md:
##########
@@ -344,6 +368,11 @@ Each row in the source has the following schema:
   <td>int</td>
   <td>The maximum batch ID available for querying state. The value could be 
invalid if the streaming query taking the checkpoint is running, as the query 
will commit further batches.</td>
 </tr>
+<tr>
+  <td>operatorProperties</td>
+  <td>string</td>
+  <td>List of properties used by the operator encoded as JSON. Available only 
for operators using schema format version 2.</td>

Review Comment:
   Done



##########
docs/structured-streaming-state-data-source.md:
##########
@@ -310,6 +311,29 @@ Dataset<Row> df = spark
 
 </div>
 
+The following options must be set for the source:
+
+<table>
+<thead><tr><th>Option</th><th>Value</th><th>Meaning</th></tr></thead>
+<tr>
+  <td>path</td>
+  <td>string</td>
+  <td>Specify the root directory of the checkpoint location. You can either 
specify the path via option("path", `path`) or load(`path`).</td>
+</tr>
+</table>
+
+The following configurations are optional:
+
+<table>
+<thead><tr><th>Option</th><th>Value</th><th>Default</th><th>Meaning</th></tr></thead>
+<tr>
+  <td>batchId</td>
+  <td>numeric value</td>
+  <td>Last committed batch if available, else 0</td>
+  <td>Optional batchId used to retrieve operator metadata at that batch. Only 
applicable for operators using schema format version 2.</td>

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49115][DOCS] Add docs for state metadata source for operators using schema format v2 such as transformWithState [spark]

Reply via email to