HeartSaVioR commented on code in PR #47625:
URL: https://github.com/apache/spark/pull/47625#discussion_r1706512846


##########
docs/structured-streaming-state-data-source.md:
##########
@@ -310,6 +311,29 @@ Dataset<Row> df = spark
 
 </div>
 
+The following options must be set for the source:
+
+<table>
+<thead><tr><th>Option</th><th>Value</th><th>Meaning</th></tr></thead>
+<tr>
+  <td>path</td>
+  <td>string</td>
+  <td>Specify the root directory of the checkpoint location. You can either 
specify the path via option("path", `path`) or load(`path`).</td>
+</tr>
+</table>
+
+The following configurations are optional:
+
+<table>
+<thead><tr><th>Option</th><th>Value</th><th>Default</th><th>Meaning</th></tr></thead>
+<tr>
+  <td>batchId</td>
+  <td>numeric value</td>
+  <td>Last committed batch if available, else 0</td>
+  <td>Optional batchId used to retrieve operator metadata at that batch. Only 
applicable for operators using schema format version 2.</td>

Review Comment:
   Let's just hide the fact it's effective only for format version 2. Users 
wouldn't need to know about it and the expected behavior would be the same.



##########
docs/structured-streaming-state-data-source.md:
##########
@@ -264,13 +264,14 @@ The output schema will also be different from the normal 
output.
 </tr>
 </table>
 
-## State metadata source
+## State Metadata Source
 
 Before querying the state from existing checkpoint via state data source, 
users would like to understand the information for the checkpoint, especially 
about state operator. This includes which operators and state store instances 
are available in the checkpoint, available range of batch IDs, etc.
 
 Structured Streaming provides a data source named "State metadata source" to 
provide the state-related metadata information from the checkpoint.
 
 Note: The metadata is constructed when the streaming query is running with 
Spark 4.0+. The existing checkpoint which has been running with lower Spark 
version does not have the metadata and will be unable to query/use with this 
metadata source. It is required to run the streaming query pointing the 
existing checkpoint in Spark 4.0+ to construct the metadata before querying.
+For operators using the operator metadata format version 1, the metadata is 
written once and does not change. For operators using metadata format version 
2, this metadata can change and can be queried by providing the relevant 
batchId.

Review Comment:
   We don't expect users to know about the operator metadata format, which 
operator uses the version 1 and which operator doesn't.
   
   Flipping the coin, would there be a practice which users can simply follow 
without knowing the details? Shall we just guide users to always specify the 
batch ID if they want to know the metadata in "point of time", regardless of 
format version? My understanding is that the outcome should be the same.



##########
docs/structured-streaming-state-data-source.md:
##########
@@ -344,6 +368,11 @@ Each row in the source has the following schema:
   <td>int</td>
   <td>The maximum batch ID available for querying state. The value could be 
invalid if the streaming query taking the checkpoint is running, as the query 
will commit further batches.</td>
 </tr>
+<tr>
+  <td>operatorProperties</td>
+  <td>string</td>
+  <td>List of properties used by the operator encoded as JSON. Available only 
for operators using schema format version 2.</td>

Review Comment:
   Let's not expose the format version - we can just say "operator dependent".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to