[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284336#comment-17284336
 ] 

Jungtaek Lim commented on SPARK-34198:
--------------------------------------

Now we have consensus on having this, which is great. Thanks for raising the 
discussion and gathering consensus.

Which implementation we should take as baseline remains the question. Two 
implementations what I'm aware of are:

1. https://github.com/chermenin/spark-states
2. https://github.com/apache/spark/pull/24922

Ideally it'd be nice to take the first one as it has been known to be the way 
to use RocksDB state store with SS for years. Probably we need to pursue the 
maintainer of the project, but in most cases they tend to be happy with 
donating the code (with simple PR, as it doesn't seem to be tied with specific 
employer) so worth trying out.

Second case is probably easier to continue, as we can leverage existing PR 
without explicit approval from original author. What we should keep in mind is 
"retaining" main authorship, otherwise it wouldn't be a problem to take it over.

Another crazy idea (but probably most stable among all implementations) is 
asking Databricks to donate the commercial implementation of RocksDB state 
store. I guess it should have been used by their customers for years, so proven 
to be stable relatively.

[~rxin] Given you'd also give +1 on adding RocksDB state store to Spark 
codebase, any chance Databricks donate the existing implementation to Spark? Is 
it just me and would it be crazy idea asking this?

> Add RocksDB StateStore as external module
> -----------------------------------------
>
>                 Key: SPARK-34198
>                 URL: https://issues.apache.org/jira/browse/SPARK-34198
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 3.2.0
>            Reporter: L. C. Hsieh
>            Assignee: L. C. Hsieh
>            Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to