anishshri-db commented on code in PR #47625:
URL: https://github.com/apache/spark/pull/47625#discussion_r1707446767
##########
docs/structured-streaming-state-data-source.md:
##########
@@ -264,13 +264,14 @@ The output schema will also be different from the normal
output.
</tr>
</table>
-## State metadata source
+## State Metadata Source
Before querying the state from existing checkpoint via state data source,
users would like to understand the information for the checkpoint, especially
about state operator. This includes which operators and state store instances
are available in the checkpoint, available range of batch IDs, etc.
Structured Streaming provides a data source named "State metadata source" to
provide the state-related metadata information from the checkpoint.
Note: The metadata is constructed when the streaming query is running with
Spark 4.0+. The existing checkpoint which has been running with lower Spark
version does not have the metadata and will be unable to query/use with this
metadata source. It is required to run the streaming query pointing the
existing checkpoint in Spark 4.0+ to construct the metadata before querying.
+For operators using the operator metadata format version 1, the metadata is
written once and does not change. For operators using metadata format version
2, this metadata can change and can be queried by providing the relevant
batchId.
Review Comment:
Done
##########
docs/structured-streaming-state-data-source.md:
##########
@@ -344,6 +368,11 @@ Each row in the source has the following schema:
<td>int</td>
<td>The maximum batch ID available for querying state. The value could be
invalid if the streaming query taking the checkpoint is running, as the query
will commit further batches.</td>
</tr>
+<tr>
+ <td>operatorProperties</td>
+ <td>string</td>
+ <td>List of properties used by the operator encoded as JSON. Available only
for operators using schema format version 2.</td>
Review Comment:
Done
##########
docs/structured-streaming-state-data-source.md:
##########
@@ -310,6 +311,29 @@ Dataset<Row> df = spark
</div>
+The following options must be set for the source:
+
+<table>
+<thead><tr><th>Option</th><th>Value</th><th>Meaning</th></tr></thead>
+<tr>
+ <td>path</td>
+ <td>string</td>
+ <td>Specify the root directory of the checkpoint location. You can either
specify the path via option("path", `path`) or load(`path`).</td>
+</tr>
+</table>
+
+The following configurations are optional:
+
+<table>
+<thead><tr><th>Option</th><th>Value</th><th>Default</th><th>Meaning</th></tr></thead>
+<tr>
+ <td>batchId</td>
+ <td>numeric value</td>
+ <td>Last committed batch if available, else 0</td>
+ <td>Optional batchId used to retrieve operator metadata at that batch. Only
applicable for operators using schema format version 2.</td>
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]