darion yaphet created SPARK-56734:
-------------------------------------

             Summary: Optimize RocksDBPersistenceEngine by segregating data 
into distinct Column Families
                 Key: SPARK-56734
                 URL: https://issues.apache.org/jira/browse/SPARK-56734
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 4.3.0
            Reporter: darion yaphet


*Motivation*
Currently, {{RocksDBPersistenceEngine}} in the Spark Master stores all metadata 
(Applications, Workers, Drivers) in a single default Column Family, using key 
prefixes to distinguish them. This causes significant performance issues during 
recovery: * *Inefficient Scanning:* Reading a specific type (e.g., 
Applications) requires scanning the entire database and performing expensive 
string prefix matching, leading to *O(N_total)* complexity.
 * *High Overhead:* The current approach wastes CPU on string operations and 
causes cache contention between different data types.

*Proposed Solution*
Refactor {{RocksDBPersistenceEngine}} to use native *Column Families* for data 
isolation (e.g., separate CFs for Apps, Workers, and Drivers). * Eliminate key 
prefixing logic and route data directly to the corresponding 
{{{}ColumnFamilyHandle{}}}.
 * Allow the engine to scan only the relevant Column Family during recovery.

*Benefits* * *Faster Recovery:* Optimizes read complexity from *O(N_total)* to 
{*}O(N_type){*}, drastically reducing Master startup time.
 * *Better Performance:* Removes string matching overhead and improves Block 
Cache hit rates.
 * *Granular Control:* Enables independent configuration (e.g., compression, 
TTL) for different metadata types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to