[ 
https://issues.apache.org/jira/browse/FLINK-20496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317697#comment-17317697
 ] 

Yu Li commented on FLINK-20496:
-------------------------------

[~sewen] According to the RocksDB 
[document|https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters#cons],
 we could find below descriptions about constraints of partitioned 
index/filters:

* Additional space for the top-level index: its quite small 0.1-1% of 
index/filter size.
* More disk IO: if the top-level index is not already in cache it would result 
to one additional IO. To avoid that they can be either stored in heap or stored 
in cache with hi priority
* Losing spatial locality: if a workload requires frequent, yet random reads 
from the same SST file, it would result into loading a separate index/filter 
partition upon each read, which is less efficient than reading the entire 
index/filter at once. Although we did not observe this pattern in our 
benchmarks, it is only likely to happen for L0/L1 layers of LSM, for which 
partitioning can be disabled (TODO work)

Since [by 
default|https://github.com/apache/flink/blob/b17e7b5d3504a4b70a9c9bcf50175a235e9186fe/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBResourceContainer.java#L162]
 we will cache index and filter blocks with high priority, I think we're ok 
with the first two aspects. For the third one, it seems only affecting 
workloads with hotspots.

Personally I'm optimistic about turning this on by default, but would suggest 
to be more cautious and do more testing against our 
[micro-benchmarks|https://github.com/apache/flink-benchmarks] and stateful 
cases in [nexmark|https://github.com/nexmark/nexmark].

Please let me know your thoughts [~sewen] [~liuyufei]. Thanks.

> RocksDB partitioned index filter option
> ---------------------------------------
>
>                 Key: FLINK-20496
>                 URL: https://issues.apache.org/jira/browse/FLINK-20496
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>    Affects Versions: 1.10.2, 1.11.2, 1.12.0
>            Reporter: YufeiLiu
>            Assignee: YufeiLiu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0
>
>
>   When using RocksDBStateBackend and enabling 
> {{state.backend.rocksdb.memory.managed}} and 
> {{state.backend.rocksdb.memory.fixed-per-slot}}, flink will strictly limited 
> rocksdb memory usage which contains "write buffer" and "block cache". With 
> these options rocksdb stores index and filters in block cache, because in 
> default options index/filters can grows unlimited.
>   But it's lead another issue, if high-priority cache(configure by 
> {{state.backend.rocksdb.memory.high-prio-pool-ratio}}) can't fit all 
> index/filters blocks, it will load all metadata from disk when cache missed, 
> and program went extremely slow. According to [Partitioned Index 
> Filters|https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters][1],
>  we can enable two-level index having acceptable performance when 
> index/filters cache missed. 
>   Enable these options can get over 10x faster in my case[2], I think we can 
> add an option {{state.backend.rocksdb.partitioned-index-filters}} and default 
> value is false, so we can use this feature easily.
> [1] Partitioned Index Filters: 
> https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters
> [2] Deduplicate scenario, state.backend.rocksdb.memory.fixed-per-slot=256M, 
> SSD, elapsed time 4.91ms -> 0.33ms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to