I don't see major comments as of now. Given that the thread was initiated more than 10 days ago and I see multiple supporters, I'm going to initiate a VOTE thread.
Please participate in the VOTE thread as well. Thanks! On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim <kabhwan.opensou...@gmail.com> wrote: > Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it > is a rather general and usual question for every new addition of data > source. Hence I want to sort it out for everyone. > > As I know, the author implemented a third-party tool for query state store >> as a data source long time ago. I've suggested some users to use the tool >> before. It is a useful tool for special cases because there is no other >> tool/feature for the purpose. >> I think for such effort to add new data source, one usual question is why >> it has to be in Spark repo instead of as a third-party tool. Especially >> this is not a frequent used one. Even for structured stream users, only >> rare cases it is necessary to look into state store content. > > > I think we do not expect the data source to be used rarely. We see two > different major use cases; 1) unit tests against stateful query 2) look > into the state during the incident to get full context. 2) is probably not > something users may encounter this frequently, hence it is valid to say the > new feature may not be used frequently. But 1) is definitely something we > can say it's tied to daily work. > > Also, even 2), it looks to be an essential feature and has to be provided > as out-of-the-box. Let's say, this feature does not exist and an user > encounters an incident in production with a stateful query. During RCA, > they realize that state is a black-box and their only option is deducing > the value of the state indirectly, mostly likely requiring them to modify > the query heavily and put artificial inputs. If I were such a user, I would > consider this lack as a fundamental issue of SS. It has been out-of-the-box > in Flink for years (State Processor), so it also makes sense for > competitive points. > > We are seeing this effort as a stepping stone. As we see comments in SPIP > doc and also previous replies, people also see the proposal as a prior work > for writer part, which we would have a chance to break the strong > preconception for fixed number of shuffle partitions. I'd argue that this > is a rather fundamental limitation of SS and I have seen so many complaints > with this. I don't feel like it is right to delegate to a 3rd party to > solve the fundamental issue. This is probably stronger evidence than the > reader part. > > Here's another aspect, during the work, we observed the lacking parts on > checkpointing e.g. the information of prefix scan does not exist in the > checkpoint, which makes a big difference on restoring the state from the > state file. When we come to the state repartitioning, the repartition is > based on the grouping keys in the operator (not the state key), hence we > will also need additional information for that. If this feature goes into > the 3rd party, it will be very painful to make both sides of the changes > altogether. It brings up another headache, versioning and compatibility > matrix. > > I hope this would help persuade people to add this to the Spark repo > rather than its own life. > > > On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Thanks Raghu for your support! >> >> Btw, I'd like to replicate the support from JIRA ticket itself, I see >> support from Chaoqin and Praveen. Thanks both! >> >> >> >> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi <raghu.ang...@databricks.com> >> wrote: >> >>> +1 overall and a big +1 to keeping offline state-rebalancing as a >>> primary use case. >>> >>> Raghu. >>> >>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny < >>> bartkoniec...@gmail.com> wrote: >>> >>>> Thank you, Jungtaek, for your answers! It's clear now. >>>> >>>> +1 for me. It seems like a prerequisite for further ops-related >>>> improvements for the state store management. I mean especially here the >>>> state rebalancing that could rely on this read+write state store API. I >>>> don't mean here the dynamic state rebalancing that could probably be >>>> implemented with a lower latency directly in the stateful API. Instead I'm >>>> thinking more of an offline job to rebalance the state and later restart >>>> the stateful pipeline with the changed number of shuffle partitions. >>>> >>>> Best, >>>> Bartosz. >>>> >>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim < >>>> kabhwan.opensou...@gmail.com> wrote: >>>> >>>>> bump for better reach >>>>> >>>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim < >>>>> kabhwan.opensou...@gmail.com> wrote: >>>>> >>>>>> Sorry, please use this link instead for SPIP doc: >>>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing >>>>>> >>>>>> >>>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim < >>>>>> kabhwan.opensou...@gmail.com> wrote: >>>>>> >>>>>>> Hi dev, >>>>>>> >>>>>>> I'd like to start a discussion on "State Data Source - Reader". >>>>>>> >>>>>>> This proposal aims to introduce a new data source "statestore" which >>>>>>> enables reading the state rows from existing checkpoint via offline >>>>>>> (batch) >>>>>>> query. This will enable users to 1) create unit tests against stateful >>>>>>> query verifying the state value (especially flatMapGroupsWithState), 2) >>>>>>> gather more context on the status when an incident occurs, especially >>>>>>> for >>>>>>> incorrect output. >>>>>>> >>>>>>> *SPIP*: >>>>>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing >>>>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511 >>>>>>> >>>>>>> Looking forward to your feedback! >>>>>>> >>>>>>> Thanks, >>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>> >>>>>>> ps. The scope of the project is narrowed to the reader in this SPIP, >>>>>>> since the writer requires us to consider more cases. We are planning on >>>>>>> it. >>>>>>> >>>>>> >>>> >>>> -- >>>> Bartosz Konieczny >>>> freelance data engineer >>>> https://www.waitingforcode.com >>>> https://github.com/bartosz25/ >>>> https://twitter.com/waitingforcode >>>> >>>>