[
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284336#comment-17284336
]
Jungtaek Lim commented on SPARK-34198:
--------------------------------------
Now we have consensus on having this, which is great. Thanks for raising the
discussion and gathering consensus.
Which implementation we should take as baseline remains the question. Two
implementations what I'm aware of are:
1. https://github.com/chermenin/spark-states
2. https://github.com/apache/spark/pull/24922
Ideally it'd be nice to take the first one as it has been known to be the way
to use RocksDB state store with SS for years. Probably we need to pursue the
maintainer of the project, but in most cases they tend to be happy with
donating the code (with simple PR, as it doesn't seem to be tied with specific
employer) so worth trying out.
Second case is probably easier to continue, as we can leverage existing PR
without explicit approval from original author. What we should keep in mind is
"retaining" main authorship, otherwise it wouldn't be a problem to take it over.
Another crazy idea (but probably most stable among all implementations) is
asking Databricks to donate the commercial implementation of RocksDB state
store. I guess it should have been used by their customers for years, so proven
to be stable relatively.
[~rxin] Given you'd also give +1 on adding RocksDB state store to Spark
codebase, any chance Databricks donate the existing implementation to Spark? Is
it just me and would it be crazy idea asking this?
> Add RocksDB StateStore as external module
> -----------------------------------------
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
> Issue Type: New Feature
> Components: Structured Streaming
> Affects Versions: 3.2.0
> Reporter: L. C. Hsieh
> Assignee: L. C. Hsieh
> Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As
> there are more and more streaming applications, some of them requires to use
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state
> management. So it is proven to be good choice for large state usage. But
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into
> Spark SS. For the concern about adding RocksDB as a direct dependency, our
> plan is to add this StateStore as an external module first.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]