HeartSaVioR commented on a change in pull request #33683:
URL: https://github.com/apache/spark/pull/33683#discussion_r684946042
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1814,6 +1814,23 @@ Specifically for built-in HDFS state store provider,
users can check the state s
it is best if cache missing count is minimized that means Spark won't waste
too much time on loading checkpointed state.
User can increase Spark locality waiting configurations to avoid loading state
store providers in different executors across batches.
+### RocksDB state store implementation
+
+As of Spark 3.2, we add a new build-in state store implementation, RocksDB
state store provider.
+
+The current build-in HDFS state store provider has two major drawbacks:
+
+* The amount of state that can be maintained is limited by the heap size of
the executors
+* State expiration by watermark and/or timeouts require full scans over all
the data
+
+The RocksDB-based State Store implementation can address these drawbacks:
+
+* RocksDB can serve data from the disk with a configurable amount of non-JVM
memory.
+* Sorting keys using the appropriate column should avoid full scans to find
the to-be-dropped keys.
Review comment:
Please correct me if I'm missing; while this could be something we can
evaluate and address, this is not true at least for now. We don't distinguish
event time field in state store.
Prefix scan is the only thing we leverage sorted key for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]