[ 
https://issues.apache.org/jira/browse/FLINK-11172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-11172:
-----------------------------------
      Labels: auto-deprioritized-major auto-unassigned  (was: auto-unassigned 
stale-major)
    Priority: Minor  (was: Major)

This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Remove the max retention time in StreamQueryConfig
> --------------------------------------------------
>
>                 Key: FLINK-11172
>                 URL: https://issues.apache.org/jira/browse/FLINK-11172
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / API
>    Affects Versions: 1.8.0
>            Reporter: Yangze Guo
>            Priority: Minor
>              Labels: auto-deprioritized-major, auto-unassigned
>
> [Stream Query 
> Config|https://ci.apache.org/projects/flink/flink-docs-master/dev/table/streaming/query_configuration.html]
>  is an important and useful feature to make a tradeoff between accuracy and 
> resource consumption when some query executed in unbounded streaming data. 
> This feature first proposed in 
> [FLINK-6491|https://issues.apache.org/jira/browse/FLINK-6491].
> At the first, *QueryConfig* take two parameters, i.e. 
> minIdleStateRetentionTime and maxIdleStateRetentionTime, to avoid to register 
> many timers if we have more freedom when to discard state. However, this 
> approach may cause new data expired earlier than old data and thus greater 
> accuracy loss appeared in some case. For example, we have an unbounded keyed 
> streaming data. We process key *_a_* in _*t0*_ and _*b*_ in _*t1,*_ *_t0 < 
> t1_*.  *_a_* will expired in _*a+maxIdleStateRetentionTime*_ while _*b*_ 
> expired in *_b+maxIdleStateRetentionTime_*. Now, another data with key *_a_* 
> arrived in _*t2 (t1 < t2)*_. But _*t2+minIdleStateRetentionTime*_ <  
> _*a+maxIdleStateRetentionTime*_. The state of key *_a_* will still be expired 
> in _*a+maxIdleStateRetentionTime*_ which is early than the state of key 
> _*b*_. According to the guideline of 
> [LRU|https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU)]
>  that the element has been most heavily used in the past few instructions are 
> most likely to be used heavily in the next few instructions too. The state 
> with key _*a*_ should live longer than the state with key _*b*_. Current 
> approach against this idea.
> I think we now have a good chance to remove the maxIdleStateRetentionTime 
> argument in *StreamQueryConfig.* Below are my reasons.
>  * [FLINK-9423|https://issues.apache.org/jira/browse/FLINK-9423] implement 
> efficient deletes for heap-based timer service. We can leverage the deletion 
> op to mitigate the abuse of timer registration.
>  * Current approach can cause new data expired earlier than old data and thus 
> greater accuracy loss appeared in some case. Users need to fine-tune these 
> two parameter to avoid this scenario. Directly following the idea of LRU 
> looks like a better solution.
> So, I plan to remove maxIdleStateRetentionTime, update the expire time only 
> depends on  _*minIdleStateRetentionTime.*_
> cc to [~sunjincheng121], [~fhueske] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to