HeartSaVioR commented on a change in pull request #33683:
URL: https://github.com/apache/spark/pull/33683#discussion_r684990617
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1814,6 +1814,23 @@ Specifically for built-in HDFS state store provider,
users can check the state s
it is best if cache missing count is minimized that means Spark won't waste
too much time on loading checkpointed state.
User can increase Spark locality waiting configurations to avoid loading state
store providers in different executors across batches.
+### RocksDB state store implementation
+
+As of Spark 3.2, we add a new build-in state store implementation, RocksDB
state store provider.
+
+The current build-in HDFS state store provider has two major drawbacks:
+
+* The amount of state that can be maintained is limited by the heap size of
the executors
+* State expiration by watermark and/or timeouts require full scans over all
the data
+
+The RocksDB-based State Store implementation can address these drawbacks:
+
+* RocksDB can serve data from the disk with a configurable amount of non-JVM
memory.
+* Sorting keys using the appropriate column should avoid full scans to find
the to-be-dropped keys.
+
+To enable the new build-in state store implementation, set
`spark.sql.streaming.stateStore.providerClass`
+to `org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider`.
Review comment:
Additionally, it would be nice if we explain the configs regarding
RocksDB state store provider, as these configs are not a part of SQLConf and
are not documented.
```
// Configuration that specifies whether to compact the RocksDB data every
time data is committed
private val COMPACT_ON_COMMIT_CONF = ConfEntry("compactOnCommit", "false")
private val PAUSE_BG_WORK_FOR_COMMIT_CONF =
ConfEntry("pauseBackgroundWorkForCommit", "true")
private val BLOCK_SIZE_KB_CONF = ConfEntry("blockSizeKB", "4")
private val BLOCK_CACHE_SIZE_MB_CONF = ConfEntry("blockCacheSizeMB", "8")
private val LOCK_ACQUIRE_TIMEOUT_MS_CONF =
ConfEntry("lockAcquireTimeoutMs", "60000")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]