Re: queryable state & streaming

Stavros Kontopoulos Mon, 18 Mar 2019 12:22:39 -0700

Not really, if we agree that we want this, I can put together a design
document and take it from there. There was also a discussion in another
thread about adding RockDB as a memory storage that is related to this task.


Best,
Stavros

On Sun, Mar 17, 2019 at 4:42 AM kant kodali <kanth...@gmail.com> wrote:

> Any update on this?
>
> On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan <ar...@apache.org> wrote:
>
>> I don't think separate API or RPCs etc might be necessary for queryable
>> state if the state can be exposed as just another datasource. Then the sql
>> queries can be issued against it just like executing sql queries against
>> any other data source.
>>
>> For now I think the "memory" sink could be used  as a sink and run
>> queries against it but I agree it does not scale for large states.
>>
>> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim <kabh...@gmail.com> wrote:
>>
>>> It doesn't seem Spark has workarounds other than storing output into
>>> external storages, so +1 on having this.
>>>
>>> My major concern on implementing queryable state in structured streaming
>>> is "Are all states available on executors at any time while query is
>>> running?" Querying state shouldn't affect the running query. Given that
>>> state is huge and default state provider is loading state in memory, we may
>>> not want to load one more redundant snapshot of state: we want to always
>>> load "current state" which query is also using. (For sure, Queryable state
>>> should be read-only.)
>>>
>>> Regarding improvement of local state, I guess it is ideal to leverage
>>> embedded db, like Kafka and Flink are doing. The difference will not be
>>> only reading state from non-heap, but also how to take a snapshot and store
>>> delta. We may want to check snapshotting works well with small batch
>>> interval, and find alternative approach when it doesn't. Sounds like it is
>>> a huge item and can be handled individually.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos <st.kontopou...@gmail.com>님이
>>> 작성:
>>>
>>>> Nice I was looking for a jira. So I agree we should justify why we are
>>>> building something. Now to that direction here is what I have seen from my
>>>> experience.
>>>> People quite often use state within their streaming app and may have
>>>> large states (TBs). Shortening the pipeline by not having to copy data (to
>>>> Cassandra for example for serving) is an advantage, in terms of at least
>>>> latency and complexity.
>>>> This can be true if we advantage of state checkpointing (locally could
>>>> be RocksDB or in general HDFS the latter is currently supported)  along
>>>> with an API to efficiently query data.
>>>> Some use cases I see:
>>>>
>>>> - real-time dashboards and real-time reporting, the faster the better
>>>> - monitoring of state for operational reasons, app health etc...
>>>> - integrating with external services via an API eg. making accessible
>>>>  aggregations over time windows to some third party service within your
>>>> system
>>>>
>>>> Regarding requirements here are some of them:
>>>> - support of an API to expose state (could be done at the spark
>>>> driver), like rest.
>>>> - supporting dynamic allocation (not sure how it affects state
>>>> management)
>>>> - an efficient way to talk to executors to get the state (rpc?)
>>>> - making local state more efficient and easier accessible with an
>>>> embedded db (I dont think this is supported from what I see, maybe wrong)?
>>>> Some people are already working with such techs and some stuff could be
>>>> re-used: https://issues.apache.org/jira/browse/SPARK-20641
>>>>
>>>> Best,
>>>> Stavros
>>>>
>>>>
>>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-16738
>>>>>
>>>>> I don't believe anyone is working on it yet.  I think the most useful
>>>>> thing is to start enumerating requirements and use cases and then we can
>>>>> talk about how to build it.
>>>>>
>>>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
>>>>> st.kontopou...@gmail.com> wrote:
>>>>>
>>>>>> Cool Burak do you have a pointer, should I take the initiative for a
>>>>>> first design document or Databricks is working on it?
>>>>>>
>>>>>> Best,
>>>>>> Stavros
>>>>>>
>>>>>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz <brk...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Stavros,
>>>>>>>
>>>>>>> Queryable state is definitely on the roadmap! We will revamp the
>>>>>>> StateStore API a bit, and a queryable StateStore is definitely one of 
>>>>>>> the
>>>>>>> things we are thinking about during that revamp.
>>>>>>>
>>>>>>> Best,
>>>>>>> Burak
>>>>>>>
>>>>>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" <
>>>>>>> st.kontopou...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Just to re-phrase my question: Would query-able state make a
>>>>>>>> viable SPIP?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Stavros
>>>>>>>>
>>>>>>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>>>>>>>> st.kontopou...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Maybe this has been discussed before. Given the fact that many
>>>>>>>>> streaming apps out there use state extensively, could be a good idea 
>>>>>>>>> to
>>>>>>>>> make Spark expose streaming state with an external API like other
>>>>>>>>> systems do (Kafka streams, Flink etc), in order to facilitate
>>>>>>>>> interactive queries?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Stavros
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>

Re: queryable state & streaming

Reply via email to