[GitHub] [spark] HeartSaVioR commented on pull request #30841: [SPARK-28191][SS] New data source - state - reader part

GitBox Thu, 31 Dec 2020 20:21:27 -0800


HeartSaVioR commented on pull request #30841:
URL: https://github.com/apache/spark/pull/30841#issuecomment-753252068



   Thanks for showing interest, @xuanyuanking !
   
   > Is it risky to open the internal state store to the end-user for 
writing/changing?
   
   I'd say they are probably wanting to rewrite state store for various reasons 
(the promising benefits are from there). But for safety I'd say it's safer to 
write to the different directory, so that in any case the data source may mess 
up, they could go back with backup. The risky case would be the thing end users 
don't know about the expected state schema and try to change it, but with 
backup they can simply go back.
   
   In addition to the proposal we simply provide the reader/writer of the state 
("state" data source), we can even introduce high level API to hide the details 
of schema and let end users only deal with their case class / java bean in case 
of (flat)MapGroupsWithState. We might be able to do the similar with streaming 
aggregation and stream-stream join, though we may need to ask about the schema 
of input in "stateful operator" at least. (That said they can delegate dealing 
with the difference of schemas between "the input schema of operator" and "the 
actual state store schema".)
   
   Actually, the writer part (not a high level API) is already implemented in 
my own repository (https://github.com/HeartSaVioR/spark-state-tools), though 
I'd like to make sure we can implement it with DSv2 API, instead of DSv1 API. 
(Currently it's implemented as DSv1, because SPARK-23889 is required but not 
yet done, now broken down to multiple JIRA issues. I expect it will be 
available in Spark 3.2.0, and that matches with the timing of migrating the 
code on writer to DSv2 API.)
   
   Someone would wonder about the reason of proposing the project to the Spark 
codebase. The reason is actually straightforward - I believe this is just a 
missing spot (not optional one) of SS and is essential to provide as built-in. 
And also easy to build a rationalization for the improvement which is critical 
to the state data source. e.g. SPARK-27237


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #30841: [SPARK-28191][SS] New data source - state - reader part

Reply via email to