[
https://issues.apache.org/jira/browse/FLINK-11172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-11172:
-----------------------------------
Labels: auto-deprioritized-major auto-deprioritized-minor auto-unassigned
(was: auto-deprioritized-major auto-unassigned stale-minor)
Priority: Not a Priority (was: Minor)
This issue was labeled "stale-minor" 7 days ago and has not received any
updates so it is being deprioritized. If this ticket is actually Minor, please
raise the priority and ask a committer to assign you the issue or revive the
public discussion.
> Remove the max retention time in StreamQueryConfig
> --------------------------------------------------
>
> Key: FLINK-11172
> URL: https://issues.apache.org/jira/browse/FLINK-11172
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / API
> Affects Versions: 1.8.0
> Reporter: Yangze Guo
> Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor,
> auto-unassigned
>
> [Stream Query
> Config|https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/query_configuration.html]
> is an important and useful feature to make a tradeoff between accuracy and
> resource consumption when some query executed in unbounded streaming data.
> This feature first proposed in
> [FLINK-6491|https://issues.apache.org/jira/browse/FLINK-6491].
> At the first, *QueryConfig* take two parameters, i.e.
> minIdleStateRetentionTime and maxIdleStateRetentionTime, to avoid to register
> many timers if we have more freedom when to discard state. However, this
> approach may cause new data expired earlier than old data and thus greater
> accuracy loss appeared in some case. For example, we have an unbounded keyed
> streaming data. We process key *_a_* in _*t0*_ and _*b*_ in _*t1,*_ *_t0 <
> t1_*. *_a_* will expired in _*a+maxIdleStateRetentionTime*_ while _*b*_
> expired in *_b+maxIdleStateRetentionTime_*. Now, another data with key *_a_*
> arrived in _*t2 (t1 < t2)*_. But _*t2+minIdleStateRetentionTime*_ <
> _*a+maxIdleStateRetentionTime*_. The state of key *_a_* will still be expired
> in _*a+maxIdleStateRetentionTime*_ which is early than the state of key
> _*b*_. According to the guideline of
> [LRU|https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)]
> that the element has been most heavily used in the past few instructions are
> most likely to be used heavily in the next few instructions too. The state
> with key _*a*_ should live longer than the state with key _*b*_. Current
> approach against this idea.
> I think we now have a good chance to remove the maxIdleStateRetentionTime
> argument in *StreamQueryConfig.* Below are my reasons.
> * [FLINK-9423|https://issues.apache.org/jira/browse/FLINK-9423] implement
> efficient deletes for heap-based timer service. We can leverage the deletion
> op to mitigate the abuse of timer registration.
> * Current approach can cause new data expired earlier than old data and thus
> greater accuracy loss appeared in some case. Users need to fine-tune these
> two parameter to avoid this scenario. Directly following the idea of LRU
> looks like a better solution.
> So, I plan to remove maxIdleStateRetentionTime, update the expire time only
> depends on _*minIdleStateRetentionTime.*_
> cc to [~sunjincheng121], [~fhueske]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)