[ 
https://issues.apache.org/jira/browse/FLINK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tzu-Li (Gordon) Tai closed FLINK-22752.
---------------------------------------
    Fix Version/s: statefun-3.1.0
       Resolution: Fixed

flink-statefun/master: 33abb7bbddf7dda2471bd8935731e00716e6b077
flink-statefun-docker/master: 060784466eec1732a282016060c6db060ff34a28

> Add robust default state configuration to StateFun
> --------------------------------------------------
>
>                 Key: FLINK-22752
>                 URL: https://issues.apache.org/jira/browse/FLINK-22752
>             Project: Flink
>          Issue Type: Sub-task
>            Reporter: Stephan Ewen
>            Assignee: Tzu-Li (Gordon) Tai
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: statefun-3.1.0
>
>
> We aim to reduce the state configuration complexity by applying a default 
> configuration with robust settings, based on lessons learned in Flink.
> *(1) Always use RocksDB.*
> _That is already the case._
> We keep this for now, as long as the only other alternative are backends with 
> Objects on the heap, which are tricky in terms of predictable JVM 
> performance. RocksDB has a significant performance cost, but more robust 
> behavior.
> *(2) Activate local recovery by default.*
> That makes recovery cheao for soft tasks failures and gracefully cancelled 
> tasks.
> We need to set these options:
>   - {{state.backend.local-recovery: true}}
>   - {{taskmanager.state.local.root-dirs: <dir>}} - some local directory that 
> will not possibly be wiped by the OS periodically, so typically some local 
> directory that is not {{/tmp}}, for example {{/local/state/recovery}}.
>   - {{state.backend.rocksdb.localdir: <dir>}} - a directory on the same FS / 
> device as above, so that one can create hard links between them (required for 
> RocksDB local checkpoints), for example {{/local/state/rocksdb}}.
> Flink will most likely adopt this as a default setting as well in the future.
> It still makes sense to pre-configer a different RocksDB working directory 
> than {{/tmp}}.
> *(3) Activate partitioned indexes by default.*
> This may cost minimal performance in some cases, but can avoid massive 
> performance regression in cases where the index blocks no longer fit into the 
> memory cache (may happen more frequently when there are too many 
> ColumnFamilies = states).
> Set {{state.backend.rocksdb.memory.partitioned-index-filters: true}}.
> See FLINK-20496 for details.
> *(4) Increase number of transfer threads by default.*
> This speeds up state recovery in many cases. The default value in Flink is a 
> bit conservative, to avoid spamming DFS (like HDFS) by default. The more 
> cloud-centric StateFun setups should be safe to use higher default value.
> Set {{state.backend.rocksdb.checkpoint.transfer.thread.num: 8}}.
> *(5) Increase RocksDB compaction threads by default.*
> The number of RocksDB compaction threads is frequently a bottleneck.
> Increasing it costs virtually nothing and mitigates that bottleneck in most 
> cases.
> {{state.backend.rocksdb.thread.num: 4}}
> _(this value is chosen under the assumption that there is only one slot per 
> TM in StateFun)._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to