Re: [DISCUSS] Add RocksDB StateStore

2021-04-28 Thread Liang-Chi Hsieh
I am fine with RocksDB state store as built-in state store. Actually the
proposal to have it as external module is to avoid the raised concern in the
previous effort.

The need to have it as experimental doesn't necessarily mean to have it as
external module, I think. They are two things. So I don't think the risk is
highly related to have it as external module or built-in one, except that we
have the state store as default one at the beginning. If it is not a default
one, and we explicitly mention it is an experimental feature, the risk is
not very different between an external module and built-in one. As a
built-in one just makes users easier to try it.

That said even the coming RocksDB state store has been supported for years,
I think it is safer to have it as experimental feature first as it lands to
OSS Spark.

Anyway, I think it is okay to add RocksDB state store among built-in state
stores along with HDFSBasedStateStore.

I also feel that we can just have RocksDB and replace LevelDB with RocksDB.
But this is another story.


Liang-Chi


Jungtaek Lim-2 wrote
> I think adding RocksDB state store to sql/core directly would be
> OK. Personally I also voted "either way is fine with me" against RocksDB
> state store implementation in Spark ecosystem. The overall stance hasn't
> changed, but I'd like to point out that the risk becomes quite lower than
> before, given the fact we can leverage Databricks RocksDB state store
> implementation.
> 
> I feel there were two major reasons to add RocksDB state store to external
> module;
> 
> 1. stability
> 
> Databricks RocksDB state store implementation has been supported for
> years,
> it won't require more time to incubate. We may want to review thoughtfully
> to ensure the open sourced proposal fits to the Apache Spark and still
> retains stability, but this is quite better than the previous targets to
> adopt which may not be tested in production for years.
> 
> That makes me think that we don't have to put it into external and
> consider
> it as experimental.
> 
> 2. dependency
> 
> From Yuanjian's mail, JNI library is the only dependency, which seems fine
> to add by default. We already have LevelDB as one of core dependencies and
> don't concern too much about the JNI library dependency. Probably someone
> might figure out that there are outstanding benefits on replacing LevelDB
> with RocksDB and then RocksDB can even be the one of core dependencies.
> 
> On Tue, Apr 27, 2021 at 6:41 PM Yuanjian Li 

> xyliyuanjian@

>  wrote:
> 
>> Hi all,
>>
>> Following the latest comments in SPARK-34198
>> https://issues.apache.org/jira/browse/SPARK-34198;, Databricks
>> decided
>> to donate the commercial implementation of the RocksDBStateStore.
>> Compared
>> with the original decision, there’s only one topic we want to raise again
>> for discussion: can we directly add the RockDBStateStoreProvider in the
>> sql/core module? This suggestion based on the following reasons:
>>
>>1.
>>
>>The RocksDBStateStore aims to solve the problem of the original
>>HDFSBasedStateStore, which is built-in.
>>2.
>>
>>End users can conveniently set the config to use the new
>>implementation.
>>3.
>>
>>We can set the RocksDB one as the default one in the future.
>>
>>
>> For the consideration of the dependency, I also checked the rocksdbjni we
>> might introduce. As a JNI package
>> https://repo1.maven.org/maven2/org/rocksdb/rocksdbjni/6.2.2/rocksdbjni-6.2.2.pom;,
>> it should not have any dependency conflicts with Apache Spark.
>>
>> Any suggestions are welcome!
>>
>> Best,
>>
>> Yuanjian





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add RocksDB StateStore

2021-04-27 Thread Jungtaek Lim
I think adding RocksDB state store to sql/core directly would be
OK. Personally I also voted "either way is fine with me" against RocksDB
state store implementation in Spark ecosystem. The overall stance hasn't
changed, but I'd like to point out that the risk becomes quite lower than
before, given the fact we can leverage Databricks RocksDB state store
implementation.

I feel there were two major reasons to add RocksDB state store to external
module;

1. stability

Databricks RocksDB state store implementation has been supported for years,
it won't require more time to incubate. We may want to review thoughtfully
to ensure the open sourced proposal fits to the Apache Spark and still
retains stability, but this is quite better than the previous targets to
adopt which may not be tested in production for years.

That makes me think that we don't have to put it into external and consider
it as experimental.

2. dependency

>From Yuanjian's mail, JNI library is the only dependency, which seems fine
to add by default. We already have LevelDB as one of core dependencies and
don't concern too much about the JNI library dependency. Probably someone
might figure out that there are outstanding benefits on replacing LevelDB
with RocksDB and then RocksDB can even be the one of core dependencies.

On Tue, Apr 27, 2021 at 6:41 PM Yuanjian Li  wrote:

> Hi all,
>
> Following the latest comments in SPARK-34198
> , Databricks decided
> to donate the commercial implementation of the RocksDBStateStore. Compared
> with the original decision, there’s only one topic we want to raise again
> for discussion: can we directly add the RockDBStateStoreProvider in the
> sql/core module? This suggestion based on the following reasons:
>
>1.
>
>The RocksDBStateStore aims to solve the problem of the original
>HDFSBasedStateStore, which is built-in.
>2.
>
>End users can conveniently set the config to use the new
>implementation.
>3.
>
>We can set the RocksDB one as the default one in the future.
>
>
> For the consideration of the dependency, I also checked the rocksdbjni we
> might introduce. As a JNI package
> ,
> it should not have any dependency conflicts with Apache Spark.
>
> Any suggestions are welcome!
>
> Best,
>
> Yuanjian
>
> Reynold Xin  于2021年2月14日周日 上午6:54写道:
>
>> Late +1
>>
>>
>> On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh 
>> wrote:
>>
>>> Hi devs,
>>>
>>> Thanks for all the inputs. I think overall there are positive inputs in
>>> Spark community about having RocksDB state store as external module. Then
>>> let's go forward with this direction and to improve structured streaming. I
>>> will keep update to the JIRA SPARK-34198.
>>>
>>> Thanks all again for the inputs and discussion.
>>>
>>> Liang-Chi Hsieh
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> - To
>>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>


Re: [DISCUSS] Add RocksDB StateStore

2021-04-27 Thread Yuanjian Li
Hi all,

Following the latest comments in SPARK-34198
, Databricks decided to
donate the commercial implementation of the RocksDBStateStore. Compared
with the original decision, there’s only one topic we want to raise again
for discussion: can we directly add the RockDBStateStoreProvider in the
sql/core module? This suggestion based on the following reasons:

   1.

   The RocksDBStateStore aims to solve the problem of the original
   HDFSBasedStateStore, which is built-in.
   2.

   End users can conveniently set the config to use the new implementation.
   3.

   We can set the RocksDB one as the default one in the future.


For the consideration of the dependency, I also checked the rocksdbjni we
might introduce. As a JNI package
,
it should not have any dependency conflicts with Apache Spark.

Any suggestions are welcome!

Best,

Yuanjian

Reynold Xin  于2021年2月14日周日 上午6:54写道:

> Late +1
>
>
> On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh  wrote:
>
>> Hi devs,
>>
>> Thanks for all the inputs. I think overall there are positive inputs in
>> Spark community about having RocksDB state store as external module. Then
>> let's go forward with this direction and to improve structured streaming. I
>> will keep update to the JIRA SPARK-34198.
>>
>> Thanks all again for the inputs and discussion.
>>
>> Liang-Chi Hsieh
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Reynold Xin
Late +1

On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh < vii...@gmail.com > wrote:

> 
> 
> 
> Hi devs,
> 
> 
> 
> Thanks for all the inputs. I think overall there are positive inputs in
> Spark community about having RocksDB state store as external module. Then
> let's go forward with this direction and to improve structured streaming.
> I will keep update to the JIRA SPARK-34198.
> 
> 
> 
> Thanks all again for the inputs and discussion.
> 
> 
> 
> Liang-Chi Hsieh
> 
> 
> 
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> 
> 
> 
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature


Re: [DISCUSS] Add RocksDB StateStore

2021-02-13 Thread Liang-Chi Hsieh
Hi devs,

Thanks for all the inputs. I think overall there are positive inputs in
Spark community about having RocksDB state store as external module. Then
let's go forward with this direction and to improve structured streaming. I
will keep update to the JIRA SPARK-34198.

Thanks all again for the inputs and discussion.

Liang-Chi Hsieh





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add RocksDB StateStore

2021-02-09 Thread Hyukjin Kwon
I mean I am okay with adding it as an external module for the extra
clarification :-)

2021년 2월 9일 (화) 오후 11:10, Hyukjin Kwon 님이 작성:

> I'm good with this too.
>
> 2021년 2월 9일 (화) 오후 4:16, DB Tsai 님이 작성:
>
>> +1 to add it as an external module so people can test it out and give
>> feedback easier.
>>
>> On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi 
>> wrote:
>> >
>> > +1 adding it any way.
>> >
>> > On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:
>> >>
>> >> +1 for an external module.
>> >>
>> >> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su 
>> wrote:
>> >>>
>> >>> +1 for (2) adding to external module.
>> >>>
>> >>> I think this feature is useful and popular in practice, and option 2
>> is not conflict with previous concern for dependency.
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Cheng Su
>> >>>
>> >>>
>> >>>
>> >>> From: Dongjoon Hyun 
>> >>> Date: Monday, February 8, 2021 at 10:39 AM
>> >>> To: Jacek Laskowski 
>> >>> Cc: Liang-Chi Hsieh , dev 
>> >>> Subject: Re: [DISCUSS] Add RocksDB StateStore
>> >>>
>> >>>
>> >>>
>> >>> Thank you, Liang-chi and all.
>> >>>
>> >>>
>> >>>
>> >>> +1 for (2) external module design because it can deliver the new
>> feature in a safe way.
>> >>>
>> >>>
>> >>>
>> >>> Bests,
>> >>>
>> >>> Dongjoon
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski 
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>>
>> >>>
>> >>> I'm "okay to add RocksDB StateStore as external module". See no
>> reason not to.
>> >>>
>> >>>
>> >>> Pozdrawiam,
>> >>>
>> >>> Jacek Laskowski
>> >>>
>> >>> 
>> >>>
>> >>> https://about.me/JacekLaskowski
>> >>>
>> >>> "The Internals Of" Online Books
>> >>>
>> >>> Follow me on https://twitter.com/jaceklaskowski
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh 
>> wrote:
>> >>>
>> >>> Hi devs,
>> >>>
>> >>> In Spark structured streaming, we need state store for state
>> management for
>> >>> stateful operators such streaming aggregates, joins, etc. We have one
>> and
>> >>> only one state store implementation now. It is in-memory hashmap
>> which was
>> >>> backed up in HDFS complaint file system at the end of every
>> micro-batch.
>> >>>
>> >>> As it basically uses in-memory map to store states, memory
>> consumption is a
>> >>> serious issue and state store size is limited by the size of the
>> executor
>> >>> memory. Moreover, state store using more memory means it may impact
>> the
>> >>> performance of task execution that requires memory too.
>> >>>
>> >>> Internally we see more streaming applications that requires large
>> state in
>> >>> stateful operations. For such requirements, we need a StateStore not
>> rely on
>> >>> memory to store states.
>> >>>
>> >>> This seems to be also true externally as several other major streaming
>> >>> frameworks already use RocksDB for state management. RocksDB is an
>> embedded
>> >>> DB and streaming engines can use it to store state instead of memory
>> >>> storage.
>> >>>
>> >>> So seems to me, it is proven to be good choice for large state usage.
>> But
>> >>> Spark SS still lacks of a built-in state store for the requirement.
>> >>>
>> >>> Previously there was one attempt SPARK-28120 to add RocksDB
>> StateStore into
>> >>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>> >>> maintenance cost and it introduces 

Re: [DISCUSS] Add RocksDB StateStore

2021-02-09 Thread Hyukjin Kwon
I'm good with this too.

2021년 2월 9일 (화) 오후 4:16, DB Tsai 님이 작성:

> +1 to add it as an external module so people can test it out and give
> feedback easier.
>
> On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi 
> wrote:
> >
> > +1 adding it any way.
> >
> > On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:
> >>
> >> +1 for an external module.
> >>
> >> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su 
> wrote:
> >>>
> >>> +1 for (2) adding to external module.
> >>>
> >>> I think this feature is useful and popular in practice, and option 2
> is not conflict with previous concern for dependency.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Cheng Su
> >>>
> >>>
> >>>
> >>> From: Dongjoon Hyun 
> >>> Date: Monday, February 8, 2021 at 10:39 AM
> >>> To: Jacek Laskowski 
> >>> Cc: Liang-Chi Hsieh , dev 
> >>> Subject: Re: [DISCUSS] Add RocksDB StateStore
> >>>
> >>>
> >>>
> >>> Thank you, Liang-chi and all.
> >>>
> >>>
> >>>
> >>> +1 for (2) external module design because it can deliver the new
> feature in a safe way.
> >>>
> >>>
> >>>
> >>> Bests,
> >>>
> >>> Dongjoon
> >>>
> >>>
> >>>
> >>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski 
> wrote:
> >>>
> >>> Hi,
> >>>
> >>>
> >>>
> >>> I'm "okay to add RocksDB StateStore as external module". See no reason
> not to.
> >>>
> >>>
> >>> Pozdrawiam,
> >>>
> >>> Jacek Laskowski
> >>>
> >>> 
> >>>
> >>> https://about.me/JacekLaskowski
> >>>
> >>> "The Internals Of" Online Books
> >>>
> >>> Follow me on https://twitter.com/jaceklaskowski
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh 
> wrote:
> >>>
> >>> Hi devs,
> >>>
> >>> In Spark structured streaming, we need state store for state
> management for
> >>> stateful operators such streaming aggregates, joins, etc. We have one
> and
> >>> only one state store implementation now. It is in-memory hashmap which
> was
> >>> backed up in HDFS complaint file system at the end of every
> micro-batch.
> >>>
> >>> As it basically uses in-memory map to store states, memory consumption
> is a
> >>> serious issue and state store size is limited by the size of the
> executor
> >>> memory. Moreover, state store using more memory means it may impact the
> >>> performance of task execution that requires memory too.
> >>>
> >>> Internally we see more streaming applications that requires large
> state in
> >>> stateful operations. For such requirements, we need a StateStore not
> rely on
> >>> memory to store states.
> >>>
> >>> This seems to be also true externally as several other major streaming
> >>> frameworks already use RocksDB for state management. RocksDB is an
> embedded
> >>> DB and streaming engines can use it to store state instead of memory
> >>> storage.
> >>>
> >>> So seems to me, it is proven to be good choice for large state usage.
> But
> >>> Spark SS still lacks of a built-in state store for the requirement.
> >>>
> >>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
> into
> >>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
> >>> maintenance cost and it introduces RocksDB dependency.
> >>>
> >>> For the first concern, as more users require to use the feature, it
> should
> >>> be highly used code in SS and more developers will look at it. For
> second
> >>> one, we propose (SPARK-34198) to add it as an external module to
> relieve the
> >>> dependency concern.
> >>>
> >>> Because it was pushed back previously, I'm going to raise this
> discussion to
> >>> know what people think about it now, in advance of submitting any code.
> >>>
> >>> I think there might be some possible opinions:
> >>>
> >>> 1. okay to add RocksDB StateStore into sql core module
> >>> 2. not okay for 1, but okay to add RocksDB StateStore as external
> module
> >>> 3. either 1 or 2 is okay
> >>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
> >>> external module
> >>>
> >>> Please let us know if you have some thoughts.
> >>>
> >>> Thank you.
> >>>
> >>> Liang-Chi Hsieh
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >>
> >>
> >> --
> >> Twitter: https://twitter.com/holdenkarau
> >> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>
> --
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread DB Tsai
+1 to add it as an external module so people can test it out and give
feedback easier.

On Mon, Feb 8, 2021 at 10:22 PM Gabor Somogyi  wrote:
>
> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:
>>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su  wrote:
>>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is not 
>>> conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> From: Dongjoon Hyun 
>>> Date: Monday, February 8, 2021 at 10:39 AM
>>> To: Jacek Laskowski 
>>> Cc: Liang-Chi Hsieh , dev 
>>> Subject: Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature in 
>>> a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason not 
>>> to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> 
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state in
>>> stateful operations. For such requirements, we need a StateStore not rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau



-- 
Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Jungtaek Lim
+1 to add, no matter to add under sql-core vs external module.

Rationalization for myself:

* The discussion thread and voices here show strong demand for adding
RocksDB state store out of the box.
* No workaround on huge state store problem out of the box. Direct
competitors on streaming frameworks provide it for years.
* Maintenance cost is the major concern when evaluating to add something,
but it can't be applied here, as contributors/committers from various
companies are willing to contribute.
* Apache Bahir project is no longer something being maintained actively -
the last release was in September 2019 based on Spark 2.4.0. We can no
longer easily say "let's add to Bahir instead".



On Tue, Feb 9, 2021 at 3:22 PM Gabor Somogyi 
wrote:

> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:
>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su  wrote:
>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is
>>> not conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> *From: *Dongjoon Hyun 
>>> *Date: *Monday, February 8, 2021 at 10:39 AM
>>> *To: *Jacek Laskowski 
>>> *Cc: *Liang-Chi Hsieh , dev 
>>> *Subject: *Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature
>>> in a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason
>>> not to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> 
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books <https://books.japila.pl/>
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>> <https://twitter.com/jaceklaskowski>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management
>>> for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which
>>> was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption
>>> is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state
>>> in
>>> stateful operations. For such requirements, we need a StateStore not
>>> rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an
>>> embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
>>> into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it
>>> should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve
>>> the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this
>>> discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Gabor Somogyi
+1 adding it any way.

On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:

> +1 for an external module.
>
> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su  wrote:
>
>> +1 for (2) adding to external module.
>>
>> I think this feature is useful and popular in practice, and option 2 is
>> not conflict with previous concern for dependency.
>>
>>
>>
>> Thanks,
>>
>> Cheng Su
>>
>>
>>
>> *From: *Dongjoon Hyun 
>> *Date: *Monday, February 8, 2021 at 10:39 AM
>> *To: *Jacek Laskowski 
>> *Cc: *Liang-Chi Hsieh , dev 
>> *Subject: *Re: [DISCUSS] Add RocksDB StateStore
>>
>>
>>
>> Thank you, Liang-chi and all.
>>
>>
>>
>> +1 for (2) external module design because it can deliver the new feature
>> in a safe way.
>>
>>
>>
>> Bests,
>>
>> Dongjoon
>>
>>
>>
>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:
>>
>> Hi,
>>
>>
>>
>> I'm "okay to add RocksDB StateStore as external module". See no reason
>> not to.
>>
>>
>> Pozdrawiam,
>>
>> Jacek Laskowski
>>
>> 
>>
>> https://about.me/JacekLaskowski
>>
>> "The Internals Of" Online Books <https://books.japila.pl/>
>>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>
>>
>>
>>
>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>>
>> Hi devs,
>>
>> In Spark structured streaming, we need state store for state management
>> for
>> stateful operators such streaming aggregates, joins, etc. We have one and
>> only one state store implementation now. It is in-memory hashmap which was
>> backed up in HDFS complaint file system at the end of every micro-batch.
>>
>> As it basically uses in-memory map to store states, memory consumption is
>> a
>> serious issue and state store size is limited by the size of the executor
>> memory. Moreover, state store using more memory means it may impact the
>> performance of task execution that requires memory too.
>>
>> Internally we see more streaming applications that requires large state in
>> stateful operations. For such requirements, we need a StateStore not rely
>> on
>> memory to store states.
>>
>> This seems to be also true externally as several other major streaming
>> frameworks already use RocksDB for state management. RocksDB is an
>> embedded
>> DB and streaming engines can use it to store state instead of memory
>> storage.
>>
>> So seems to me, it is proven to be good choice for large state usage. But
>> Spark SS still lacks of a built-in state store for the requirement.
>>
>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
>> into
>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>> maintenance cost and it introduces RocksDB dependency.
>>
>> For the first concern, as more users require to use the feature, it should
>> be highly used code in SS and more developers will look at it. For second
>> one, we propose (SPARK-34198) to add it as an external module to relieve
>> the
>> dependency concern.
>>
>> Because it was pushed back previously, I'm going to raise this discussion
>> to
>> know what people think about it now, in advance of submitting any code.
>>
>> I think there might be some possible opinions:
>>
>> 1. okay to add RocksDB StateStore into sql core module
>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>> 3. either 1 or 2 is okay
>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>> external module
>>
>> Please let us know if you have some thoughts.
>>
>> Thank you.
>>
>> Liang-Chi Hsieh
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Holden Karau
+1 for an external module.

On Mon, Feb 8, 2021 at 11:51 AM Cheng Su  wrote:

> +1 for (2) adding to external module.
>
> I think this feature is useful and popular in practice, and option 2 is
> not conflict with previous concern for dependency.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *Dongjoon Hyun 
> *Date: *Monday, February 8, 2021 at 10:39 AM
> *To: *Jacek Laskowski 
> *Cc: *Liang-Chi Hsieh , dev 
> *Subject: *Re: [DISCUSS] Add RocksDB StateStore
>
>
>
> Thank you, Liang-chi and all.
>
>
>
> +1 for (2) external module design because it can deliver the new feature
> in a safe way.
>
>
>
> Bests,
>
> Dongjoon
>
>
>
> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:
>
> Hi,
>
>
>
> I'm "okay to add RocksDB StateStore as external module". See no reason not
> to.
>
>
> Pozdrawiam,
>
> Jacek Laskowski
>
> 
>
> https://about.me/JacekLaskowski
>
> "The Internals Of" Online Books <https://books.japila.pl/>
>
> Follow me on https://twitter.com/jaceklaskowski
>
>
> <https://twitter.com/jaceklaskowski>
>
>
>
>
>
> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>
> Hi devs,
>
> In Spark structured streaming, we need state store for state management for
> stateful operators such streaming aggregates, joins, etc. We have one and
> only one state store implementation now. It is in-memory hashmap which was
> backed up in HDFS complaint file system at the end of every micro-batch.
>
> As it basically uses in-memory map to store states, memory consumption is a
> serious issue and state store size is limited by the size of the executor
> memory. Moreover, state store using more memory means it may impact the
> performance of task execution that requires memory too.
>
> Internally we see more streaming applications that requires large state in
> stateful operations. For such requirements, we need a StateStore not rely
> on
> memory to store states.
>
> This seems to be also true externally as several other major streaming
> frameworks already use RocksDB for state management. RocksDB is an embedded
> DB and streaming engines can use it to store state instead of memory
> storage.
>
> So seems to me, it is proven to be good choice for large state usage. But
> Spark SS still lacks of a built-in state store for the requirement.
>
> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
> Spark SS. IIUC, it was pushed back due to two concerns: extra code
> maintenance cost and it introduces RocksDB dependency.
>
> For the first concern, as more users require to use the feature, it should
> be highly used code in SS and more developers will look at it. For second
> one, we propose (SPARK-34198) to add it as an external module to relieve
> the
> dependency concern.
>
> Because it was pushed back previously, I'm going to raise this discussion
> to
> know what people think about it now, in advance of submitting any code.
>
> I think there might be some possible opinions:
>
> 1. okay to add RocksDB StateStore into sql core module
> 2. not okay for 1, but okay to add RocksDB StateStore as external module
> 3. either 1 or 2 is okay
> 4. not okay to add RocksDB StateStore, no matter into sql core or as
> external module
>
> Please let us know if you have some thoughts.
>
> Thank you.
>
> Liang-Chi Hsieh
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Cheng Su
+1 for (2) adding to external module.
I think this feature is useful and popular in practice, and option 2 is not 
conflict with previous concern for dependency.

Thanks,
Cheng Su

From: Dongjoon Hyun 
Date: Monday, February 8, 2021 at 10:39 AM
To: Jacek Laskowski 
Cc: Liang-Chi Hsieh , dev 
Subject: Re: [DISCUSS] Add RocksDB StateStore

Thank you, Liang-chi and all.

+1 for (2) external module design because it can deliver the new feature in a 
safe way.

Bests,
Dongjoon

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski 
mailto:ja...@japila.pl>> wrote:
Hi,

I'm "okay to add RocksDB StateStore as external module". See no reason not to.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski<https://about.me/JacekLaskowski>
"The Internals Of" Online Books<https://books.japila.pl/>
Follow me on 
https://twitter.com/jaceklaskowski<https://twitter.com/jaceklaskowski>

<https://twitter.com/jaceklaskowski>


On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh 
mailto:vii...@gmail.com>> wrote:
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: 
http://apache-spark-developers-list.1001551.n3.nabble.com/<http://apache-spark-developers-list.1001551.n3.nabble.com/>

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Dongjoon Hyun
Thank you, Liang-chi and all.

+1 for (2) external module design because it can deliver the new feature in
a safe way.

Bests,
Dongjoon

On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:

> Hi,
>
> I'm "okay to add RocksDB StateStore as external module". See no reason not
> to.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>
>> Hi devs,
>>
>> In Spark structured streaming, we need state store for state management
>> for
>> stateful operators such streaming aggregates, joins, etc. We have one and
>> only one state store implementation now. It is in-memory hashmap which was
>> backed up in HDFS complaint file system at the end of every micro-batch.
>>
>> As it basically uses in-memory map to store states, memory consumption is
>> a
>> serious issue and state store size is limited by the size of the executor
>> memory. Moreover, state store using more memory means it may impact the
>> performance of task execution that requires memory too.
>>
>> Internally we see more streaming applications that requires large state in
>> stateful operations. For such requirements, we need a StateStore not rely
>> on
>> memory to store states.
>>
>> This seems to be also true externally as several other major streaming
>> frameworks already use RocksDB for state management. RocksDB is an
>> embedded
>> DB and streaming engines can use it to store state instead of memory
>> storage.
>>
>> So seems to me, it is proven to be good choice for large state usage. But
>> Spark SS still lacks of a built-in state store for the requirement.
>>
>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
>> into
>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>> maintenance cost and it introduces RocksDB dependency.
>>
>> For the first concern, as more users require to use the feature, it should
>> be highly used code in SS and more developers will look at it. For second
>> one, we propose (SPARK-34198) to add it as an external module to relieve
>> the
>> dependency concern.
>>
>> Because it was pushed back previously, I'm going to raise this discussion
>> to
>> know what people think about it now, in advance of submitting any code.
>>
>> I think there might be some possible opinions:
>>
>> 1. okay to add RocksDB StateStore into sql core module
>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>> 3. either 1 or 2 is okay
>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>> external module
>>
>> Please let us know if you have some thoughts.
>>
>> Thank you.
>>
>> Liang-Chi Hsieh
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Jacek Laskowski
Hi,

I'm "okay to add RocksDB StateStore as external module". See no reason not
to.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:

> Hi devs,
>
> In Spark structured streaming, we need state store for state management for
> stateful operators such streaming aggregates, joins, etc. We have one and
> only one state store implementation now. It is in-memory hashmap which was
> backed up in HDFS complaint file system at the end of every micro-batch.
>
> As it basically uses in-memory map to store states, memory consumption is a
> serious issue and state store size is limited by the size of the executor
> memory. Moreover, state store using more memory means it may impact the
> performance of task execution that requires memory too.
>
> Internally we see more streaming applications that requires large state in
> stateful operations. For such requirements, we need a StateStore not rely
> on
> memory to store states.
>
> This seems to be also true externally as several other major streaming
> frameworks already use RocksDB for state management. RocksDB is an embedded
> DB and streaming engines can use it to store state instead of memory
> storage.
>
> So seems to me, it is proven to be good choice for large state usage. But
> Spark SS still lacks of a built-in state store for the requirement.
>
> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
> Spark SS. IIUC, it was pushed back due to two concerns: extra code
> maintenance cost and it introduces RocksDB dependency.
>
> For the first concern, as more users require to use the feature, it should
> be highly used code in SS and more developers will look at it. For second
> one, we propose (SPARK-34198) to add it as an external module to relieve
> the
> dependency concern.
>
> Because it was pushed back previously, I'm going to raise this discussion
> to
> know what people think about it now, in advance of submitting any code.
>
> I think there might be some possible opinions:
>
> 1. okay to add RocksDB StateStore into sql core module
> 2. not okay for 1, but okay to add RocksDB StateStore as external module
> 3. either 1 or 2 is okay
> 4. not okay to add RocksDB StateStore, no matter into sql core or as
> external module
>
> Please let us know if you have some thoughts.
>
> Thank you.
>
> Liang-Chi Hsieh
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Liang-Chi Hsieh
Thank you for the inputs! Yikun. Let's take these inputs when we are ready to
have rocksdb state store in Spark SS.


Yikun Jiang wrote
> I worked on some work about rocksdb multi-arch support and version upgrade
> on
> Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again,
> I
> want to
> give some inputs in here about rocksdb version selection from multi-arch
> support
> view. Hope it helps.
> 
> The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports
> all Arm64
> related commits to 5.18.4 and release a all platforms support version.
> 
> So, from multi-arch support view, the better rocksdb version is the
> version
> since
> v6.4.6, or 5.X version is v5.18.4.
> 
> [1] https://issues.apache.org/jira/browse/STORM-3599
> [2] https://github.com/apache/kafka/pull/8284
> [3] https://issues.apache.org/jira/browse/FLINK-13598
> [4] https://github.com/facebook/rocksdb/pull/6250
> 
> Regards,
> Yikun





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Add RocksDB StateStore

2021-02-07 Thread Yikun Jiang
I worked on some work about rocksdb multi-arch support and version upgrade
on
Kafka/Storm/Flink[1][2][3].To avoid these issues happened in spark again, I
want to
give some inputs in here about rocksdb version selection from multi-arch
support
view. Hope it helps.

The Rocksdb adds Arm64 support [4] since version 6.4.6, and also backports
all Arm64
related commits to 5.18.4 and release a all platforms support version.

So, from multi-arch support view, the better rocksdb version is the version
since
v6.4.6, or 5.X version is v5.18.4.

[1] https://issues.apache.org/jira/browse/STORM-3599
[2] https://github.com/apache/kafka/pull/8284
[3] https://issues.apache.org/jira/browse/FLINK-13598
[4] https://github.com/facebook/rocksdb/pull/6250

Regards,
Yikun

Liang-Chi Hsieh  于2021年2月2日周二 下午4:32写道:

> Hi devs,
>
> In Spark structured streaming, we need state store for state management for
> stateful operators such streaming aggregates, joins, etc. We have one and
> only one state store implementation now. It is in-memory hashmap which was
> backed up in HDFS complaint file system at the end of every micro-batch.
>
> As it basically uses in-memory map to store states, memory consumption is a
> serious issue and state store size is limited by the size of the executor
> memory. Moreover, state store using more memory means it may impact the
> performance of task execution that requires memory too.
>
> Internally we see more streaming applications that requires large state in
> stateful operations. For such requirements, we need a StateStore not rely
> on
> memory to store states.
>
> This seems to be also true externally as several other major streaming
> frameworks already use RocksDB for state management. RocksDB is an embedded
> DB and streaming engines can use it to store state instead of memory
> storage.
>
> So seems to me, it is proven to be good choice for large state usage. But
> Spark SS still lacks of a built-in state store for the requirement.
>
> Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
> Spark SS. IIUC, it was pushed back due to two concerns: extra code
> maintenance cost and it introduces RocksDB dependency.
>
> For the first concern, as more users require to use the feature, it should
> be highly used code in SS and more developers will look at it. For second
> one, we propose (SPARK-34198) to add it as an external module to relieve
> the
> dependency concern.
>
> Because it was pushed back previously, I'm going to raise this discussion
> to
> know what people think about it now, in advance of submitting any code.
>
> I think there might be some possible opinions:
>
> 1. okay to add RocksDB StateStore into sql core module
> 2. not okay for 1, but okay to add RocksDB StateStore as external module
> 3. either 1 or 2 is okay
> 4. not okay to add RocksDB StateStore, no matter into sql core or as
> external module
>
> Please let us know if you have some thoughts.
>
> Thank you.
>
> Liang-Chi Hsieh
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Add RocksDB StateStore

2021-02-03 Thread redsk
Hi, 

FYI, I have been using the project at
https://github.com/chermenin/spark-states
for a few months and it has been working well for me.

-Nico



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[DISCUSS] Add RocksDB StateStore

2021-02-02 Thread Liang-Chi Hsieh
Hi devs,

In Spark structured streaming, we need state store for state management for
stateful operators such streaming aggregates, joins, etc. We have one and
only one state store implementation now. It is in-memory hashmap which was
backed up in HDFS complaint file system at the end of every micro-batch.

As it basically uses in-memory map to store states, memory consumption is a
serious issue and state store size is limited by the size of the executor
memory. Moreover, state store using more memory means it may impact the
performance of task execution that requires memory too.

Internally we see more streaming applications that requires large state in
stateful operations. For such requirements, we need a StateStore not rely on
memory to store states.

This seems to be also true externally as several other major streaming
frameworks already use RocksDB for state management. RocksDB is an embedded
DB and streaming engines can use it to store state instead of memory
storage.

So seems to me, it is proven to be good choice for large state usage. But
Spark SS still lacks of a built-in state store for the requirement.

Previously there was one attempt SPARK-28120 to add RocksDB StateStore into
Spark SS. IIUC, it was pushed back due to two concerns: extra code
maintenance cost and it introduces RocksDB dependency.

For the first concern, as more users require to use the feature, it should
be highly used code in SS and more developers will look at it. For second
one, we propose (SPARK-34198) to add it as an external module to relieve the
dependency concern.

Because it was pushed back previously, I'm going to raise this discussion to
know what people think about it now, in advance of submitting any code.

I think there might be some possible opinions:

1. okay to add RocksDB StateStore into sql core module
2. not okay for 1, but okay to add RocksDB StateStore as external module
3. either 1 or 2 is okay
4. not okay to add RocksDB StateStore, no matter into sql core or as
external module

Please let us know if you have some thoughts.

Thank you.

Liang-Chi Hsieh




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org