Re: [DISCUSS] SPIP: State Data Source - Reader

Jungtaek Lim Sun, 22 Oct 2023 21:08:15 -0700

I don't see major comments as of now. Given that the thread was initiated
more than 10 days ago and I see multiple supporters, I'm going to initiate
a VOTE thread.


Please participate in the VOTE thread as well. Thanks!

On Thu, Oct 19, 2023 at 11:39 AM Jungtaek Lim <[email protected]>
wrote:

> Also, I want to replicate the comment Liang-Chi put into SPIP doc, as it
> is a rather general and usual question for every new addition of data
> source. Hence I want to sort it out for everyone.
>
> As I know, the author implemented a third-party tool for query state store
>> as a data source long time ago. I've suggested some users to use the tool
>> before. It is a useful tool for special cases because there is no other
>> tool/feature for the purpose.
>> I think for such effort to add new data source, one usual question is why
>> it has to be in Spark repo instead of as a third-party tool. Especially
>> this is not a frequent used one. Even for structured stream users, only
>> rare cases it is necessary to look into state store content.
>
>
> I think we do not expect the data source to be used rarely. We see two
> different major use cases; 1) unit tests against stateful query 2) look
> into the state during the incident to get full context. 2) is probably not
> something users may encounter this frequently, hence it is valid to say the
> new feature may not be used frequently. But 1) is definitely something we
> can say it's tied to daily work.
>
> Also, even 2), it looks to be an essential feature and has to be provided
> as out-of-the-box. Let's say, this feature does not exist and an user
> encounters an incident in production with a stateful query. During RCA,
> they realize that state is a black-box and their only option is deducing
> the value of the state indirectly, mostly likely requiring them to modify
> the query heavily and put artificial inputs. If I were such a user, I would
> consider this lack as a fundamental issue of SS. It has been out-of-the-box
> in Flink for years (State Processor), so it also makes sense for
> competitive points.
>
> We are seeing this effort as a stepping stone. As we see comments in SPIP
> doc and also previous replies, people also see the proposal as a prior work
> for writer part, which we would have a chance to break the strong
> preconception for fixed number of shuffle partitions. I'd argue that this
> is a rather fundamental limitation of SS and I have seen so many complaints
> with this. I don't feel like it is right to delegate to a 3rd party to
> solve the fundamental issue. This is probably stronger evidence than the
> reader part.
>
> Here's another aspect, during the work, we observed the lacking parts on
> checkpointing e.g. the information of prefix scan does not exist in the
> checkpoint, which makes a big difference on restoring the state from the
> state file. When we come to the state repartitioning, the repartition is
> based on the grouping keys in the operator (not the state key), hence we
> will also need additional information for that. If this feature goes into
> the 3rd party, it will be very painful to make both sides of the changes
> altogether. It brings up another headache, versioning and compatibility
> matrix.
>
> I hope this would help persuade people to add this to the Spark repo
> rather than its own life.
>
>
> On Thu, Oct 19, 2023 at 11:08 AM Jungtaek Lim <
> [email protected]> wrote:
>
>> Thanks Raghu for your support!
>>
>> Btw, I'd like to replicate the support from JIRA ticket itself, I see
>> support from Chaoqin and Praveen. Thanks both!
>>
>>
>>
>> On Thu, Oct 19, 2023 at 5:56 AM Raghu Angadi <[email protected]>
>> wrote:
>>
>>> +1 overall and a big +1 to keeping offline state-rebalancing as a
>>> primary use case.
>>>
>>> Raghu.
>>>
>>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>>> [email protected]> wrote:
>>>
>>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>>
>>>> +1 for me. It seems like a prerequisite for further ops-related
>>>> improvements for the state store management. I mean especially here the
>>>> state rebalancing that could rely on this read+write state store API. I
>>>> don't mean here the dynamic state rebalancing that could probably be
>>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>>> thinking more of an offline job to rebalance the state and later restart
>>>> the stateful pipeline with the changed number of shuffle partitions.
>>>>
>>>> Best,
>>>> Bartosz.
>>>>
>>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>>>> [email protected]> wrote:
>>>>
>>>>> bump for better reach
>>>>>
>>>>> On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Sorry, please use this link instead for SPIP doc:
>>>>>> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi dev,
>>>>>>>
>>>>>>> I'd like to start a discussion on "State Data Source - Reader".
>>>>>>>
>>>>>>> This proposal aims to introduce a new data source "statestore" which
>>>>>>> enables reading the state rows from existing checkpoint via offline 
>>>>>>> (batch)
>>>>>>> query. This will enable users to 1) create unit tests against stateful
>>>>>>> query verifying the state value (especially flatMapGroupsWithState), 2)
>>>>>>> gather more context on the status when an incident occurs, especially 
>>>>>>> for
>>>>>>> incorrect output.
>>>>>>>
>>>>>>> *SPIP*:
>>>>>>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>>>>>>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>>>>>>
>>>>>>> Looking forward to your feedback!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>
>>>>>>> ps. The scope of the project is narrowed to the reader in this SPIP,
>>>>>>> since the writer requires us to consider more cases. We are planning on 
>>>>>>> it.
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Bartosz Konieczny
>>>> freelance data engineer
>>>> https://www.waitingforcode.com
>>>> https://github.com/bartosz25/
>>>> https://twitter.com/waitingforcode
>>>>
>>>>

Re: [DISCUSS] SPIP: State Data Source - Reader

Reply via email to