HeartSaVioR opened a new pull request, #43425:
URL: https://github.com/apache/spark/pull/43425
### What changes were proposed in this pull request?
This PR proposes to introduce a baseline implementation of state processor -
reader.
State processor is a new data source which enables reading and writing the
state in the existing checkpoint with the batch query. Since we implement the
feature as data source, we are leveraging the UX for DataFrame API which most
users are already familiar with.
Functionalities of the baseline implementation are following:
* Specify a state store instance via store name (default: DEFAULT)
* Specify a stateful operator via operator ID (default: 0)
* Specify a batch ID (default: last committed)
* Specify the source option joinSide to construct input rows in the state
store for stream-stream join
* It is still possible that users can read a specific state store instance
from 4 instances in stream-stream join, which would be used mostly for
debugging Spark itself
* When this is enabled, the data source hides the internal column from the
output.
* Specify a metadata column (_partition_id)so that users can indicate the
partition ID for the state row.
### Why are the changes needed?
Please refer to the SPIP doc for rationale:
https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
### Does this PR introduce _any_ user-facing change?
Yes, we are adding a new data source.
### How was this patch tested?
New test suite.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]