Not really, if we agree that we want this, I can put together a design document and take it from there. There was also a discussion in another thread about adding RockDB as a memory storage that is related to this task.
Best, Stavros On Sun, Mar 17, 2019 at 4:42 AM kant kodali <kanth...@gmail.com> wrote: > Any update on this? > > On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan <ar...@apache.org> wrote: > >> I don't think separate API or RPCs etc might be necessary for queryable >> state if the state can be exposed as just another datasource. Then the sql >> queries can be issued against it just like executing sql queries against >> any other data source. >> >> For now I think the "memory" sink could be used as a sink and run >> queries against it but I agree it does not scale for large states. >> >> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim <kabh...@gmail.com> wrote: >> >>> It doesn't seem Spark has workarounds other than storing output into >>> external storages, so +1 on having this. >>> >>> My major concern on implementing queryable state in structured streaming >>> is "Are all states available on executors at any time while query is >>> running?" Querying state shouldn't affect the running query. Given that >>> state is huge and default state provider is loading state in memory, we may >>> not want to load one more redundant snapshot of state: we want to always >>> load "current state" which query is also using. (For sure, Queryable state >>> should be read-only.) >>> >>> Regarding improvement of local state, I guess it is ideal to leverage >>> embedded db, like Kafka and Flink are doing. The difference will not be >>> only reading state from non-heap, but also how to take a snapshot and store >>> delta. We may want to check snapshotting works well with small batch >>> interval, and find alternative approach when it doesn't. Sounds like it is >>> a huge item and can be handled individually. >>> >>> - Jungtaek Lim (HeartSaVioR) >>> >>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos <st.kontopou...@gmail.com>님이 >>> 작성: >>> >>>> Nice I was looking for a jira. So I agree we should justify why we are >>>> building something. Now to that direction here is what I have seen from my >>>> experience. >>>> People quite often use state within their streaming app and may have >>>> large states (TBs). Shortening the pipeline by not having to copy data (to >>>> Cassandra for example for serving) is an advantage, in terms of at least >>>> latency and complexity. >>>> This can be true if we advantage of state checkpointing (locally could >>>> be RocksDB or in general HDFS the latter is currently supported) along >>>> with an API to efficiently query data. >>>> Some use cases I see: >>>> >>>> - real-time dashboards and real-time reporting, the faster the better >>>> - monitoring of state for operational reasons, app health etc... >>>> - integrating with external services via an API eg. making accessible >>>> aggregations over time windows to some third party service within your >>>> system >>>> >>>> Regarding requirements here are some of them: >>>> - support of an API to expose state (could be done at the spark >>>> driver), like rest. >>>> - supporting dynamic allocation (not sure how it affects state >>>> management) >>>> - an efficient way to talk to executors to get the state (rpc?) >>>> - making local state more efficient and easier accessible with an >>>> embedded db (I dont think this is supported from what I see, maybe wrong)? >>>> Some people are already working with such techs and some stuff could be >>>> re-used: https://issues.apache.org/jira/browse/SPARK-20641 >>>> >>>> Best, >>>> Stavros >>>> >>>> >>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> https://issues.apache.org/jira/browse/SPARK-16738 >>>>> >>>>> I don't believe anyone is working on it yet. I think the most useful >>>>> thing is to start enumerating requirements and use cases and then we can >>>>> talk about how to build it. >>>>> >>>>> On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos < >>>>> st.kontopou...@gmail.com> wrote: >>>>> >>>>>> Cool Burak do you have a pointer, should I take the initiative for a >>>>>> first design document or Databricks is working on it? >>>>>> >>>>>> Best, >>>>>> Stavros >>>>>> >>>>>> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz <brk...@gmail.com> wrote: >>>>>> >>>>>>> Hi Stavros, >>>>>>> >>>>>>> Queryable state is definitely on the roadmap! We will revamp the >>>>>>> StateStore API a bit, and a queryable StateStore is definitely one of >>>>>>> the >>>>>>> things we are thinking about during that revamp. >>>>>>> >>>>>>> Best, >>>>>>> Burak >>>>>>> >>>>>>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" < >>>>>>> st.kontopou...@gmail.com> wrote: >>>>>>> >>>>>>>> Just to re-phrase my question: Would query-able state make a >>>>>>>> viable SPIP? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Stavros >>>>>>>> >>>>>>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos < >>>>>>>> st.kontopou...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Maybe this has been discussed before. Given the fact that many >>>>>>>>> streaming apps out there use state extensively, could be a good idea >>>>>>>>> to >>>>>>>>> make Spark expose streaming state with an external API like other >>>>>>>>> systems do (Kafka streams, Flink etc), in order to facilitate >>>>>>>>> interactive queries? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Stavros >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>