+1 for an external module. On Mon, Feb 8, 2021 at 11:51 AM Cheng Su <chen...@fb.com.invalid> wrote:
> +1 for (2) adding to external module. > > I think this feature is useful and popular in practice, and option 2 is > not conflict with previous concern for dependency. > > > > Thanks, > > Cheng Su > > > > *From: *Dongjoon Hyun <dongjoon.h...@gmail.com> > *Date: *Monday, February 8, 2021 at 10:39 AM > *To: *Jacek Laskowski <ja...@japila.pl> > *Cc: *Liang-Chi Hsieh <vii...@gmail.com>, dev <dev@spark.apache.org> > *Subject: *Re: [DISCUSS] Add RocksDB StateStore > > > > Thank you, Liang-chi and all. > > > > +1 for (2) external module design because it can deliver the new feature > in a safe way. > > > > Bests, > > Dongjoon > > > > On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski <ja...@japila.pl> wrote: > > Hi, > > > > I'm "okay to add RocksDB StateStore as external module". See no reason not > to. > > > Pozdrawiam, > > Jacek Laskowski > > ---- > > https://about.me/JacekLaskowski > > "The Internals Of" Online Books <https://books.japila.pl/> > > Follow me on https://twitter.com/jaceklaskowski > > > <https://twitter.com/jaceklaskowski> > > > > > > On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh <vii...@gmail.com> wrote: > > Hi devs, > > In Spark structured streaming, we need state store for state management for > stateful operators such streaming aggregates, joins, etc. We have one and > only one state store implementation now. It is in-memory hashmap which was > backed up in HDFS complaint file system at the end of every micro-batch. > > As it basically uses in-memory map to store states, memory consumption is a > serious issue and state store size is limited by the size of the executor > memory. Moreover, state store using more memory means it may impact the > performance of task execution that requires memory too. > > Internally we see more streaming applications that requires large state in > stateful operations. For such requirements, we need a StateStore not rely > on > memory to store states. > > This seems to be also true externally as several other major streaming > frameworks already use RocksDB for state management. RocksDB is an embedded > DB and streaming engines can use it to store state instead of memory > storage. > > So seems to me, it is proven to be good choice for large state usage. But > Spark SS still lacks of a built-in state store for the requirement. > > Previously there was one attempt SPARK-28120 to add RocksDB StateStore into > Spark SS. IIUC, it was pushed back due to two concerns: extra code > maintenance cost and it introduces RocksDB dependency. > > For the first concern, as more users require to use the feature, it should > be highly used code in SS and more developers will look at it. For second > one, we propose (SPARK-34198) to add it as an external module to relieve > the > dependency concern. > > Because it was pushed back previously, I'm going to raise this discussion > to > know what people think about it now, in advance of submitting any code. > > I think there might be some possible opinions: > > 1. okay to add RocksDB StateStore into sql core module > 2. not okay for 1, but okay to add RocksDB StateStore as external module > 3. either 1 or 2 is okay > 4. not okay to add RocksDB StateStore, no matter into sql core or as > external module > > Please let us know if you have some thoughts. > > Thank you. > > Liang-Chi Hsieh > > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau