[ 
https://issues.apache.org/jira/browse/SPARK-43244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717407#comment-17717407
 ] 

Anish Shrigondekar commented on SPARK-43244:
--------------------------------------------

Yes correct. All the usage will be accounted for using block cache (block 
cache, memtables, filter/index blocks). And we are getting rid of 
writeBatchWithIndex. So, with the proposed changes, we can cap memory usage on 
a per node basis as well. 

> RocksDB State Store can accumulate unbounded native memory
> ----------------------------------------------------------
>
>                 Key: SPARK-43244
>                 URL: https://issues.apache.org/jira/browse/SPARK-43244
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 3.3.2
>            Reporter: Adam Binford
>            Priority: Major
>
> We noticed in one of our production stateful streaming jobs using RocksDB 
> that an executor with 20g of heap was using around 40g of resident memory. I 
> noticed a single RocksDB instance was using around 150 MiB of memory, and 
> only 5 MiB or so of this was from the write batch (which is now cleared after 
> committing).
> After reading about RocksDB memory usage (this link was helpful: 
> [https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md])
>  I realized a lot of this was likely the "Index and Filters" memory usage. 
> This job is doing a streaming duplicate with a lot of unique keys so it makes 
> sense these block usages would be high. The problem is that, because as it is 
> now the underlying RocksDB instance stays open on an executor as long as that 
> executor is assigned that stateful partition (to be reused across batches). 
> So a single executor can accumulate a large number of RocksDB instances open 
> at once, each using a certain amount of native memory. In the worst case, a 
> single executor could need to keep open every single partitions' RocksDB 
> instance at once. 
> There are a couple ways you can control the amount of memory used, such as 
> limiting the max open files, or adding the option to use the block cache for 
> the indices and filters, but neither of these solve the underlying problem of 
> accumulating native memory from multiple partitions on an executor.
> The real fix needs to be a mechanism and option to close the underlying 
> RocksDB instance at the end of each task, so you have the option to only ever 
> have one RocksDB instance open at a time, thus having predictable memory 
> usage no matter the size of your data or number of shuffle partitions. 
> We are running this on Spark 3.3, but just kicked off a test to see if things 
> are any different in Spark 3.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to