Hi, Becket. Thank you very much for your reply. It’s fine to me. I have also updated the wiki based on the comments.
Looking forward to your next review. Best, Yuepeng Pan. ---- Replied Message ---- | From | Becket Qin<becket....@gmail.com> | | Date | 08/23/2025 04:23 | | To | <dev@flink.apache.org> | | Subject | Re: [DISCUSS] FLIP-490: Enhanced Job History Retention Policies for HistoryServer | Hi Yuepeng, I think it would be better to keep the configurations straight forward instead of conditional if possible. How about just adding one config of TTL. We will remove a job archive either when its TTL has passed, or when the retained job count has been reached and the job is the earliest job. Thanks, Jiangjie (Becket) Qin On Fri, Aug 22, 2025 at 2:55 AM Yuepeng Pan <panyuep...@apache.org> wrote: Would anyone like to discuss this FLIP? I'd appreciate your feedback and suggestions. Best, Yuepeng Pan On 2025/08/20 07:13:44 Yuepeng Pan wrote: Hi, Becket. Thank you for the clarification. Please let me have a try on revisiting these two questions with a explanation: I meant to ask what is the use case for ttlOrQuantity mode? Is it sufficient to delete the job archive when either TTL or quantity is reached if both are set? As the configuration key 'historyserver.archive.retained-jobs.mode' literally suggests, this policy governs the retention mode for archived historical jobs. When set to 'ttlOrQuantity', a target file will be retained if either of the following conditions is met (in other words, deletion occurs only if both conditions are unsatisfied): - The file count is within the maximum retention threshold. - The file remains within the TTL (Time to Live) period. Regarding the case when there are multiple history server instances, if we don't enforce a behavior, users can go with either a) and b), and it would just be up to the user to choose. We need to document the behavior properly. Thanks for the comment. And I added the related content as note/comment[1] of the new configuration 'historyserver.archive.retained-jobs.mode' . In the subsequent implementation phase, this part of the description will be refined and added to the corresponding configuration documentation. Best, Yuepeng Pan. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857#FLIP490:EnhancedJobHistoryRetentionPoliciesforHistoryServer-PublicInterfaces On 2025/08/20 04:55:24 Becket Qin wrote: Hi Yuepeng, Sorry for the confusion. I meant to ask what is the use case for ttlOrQuantity mode? Is it sufficient to delete the job archive when either TTL or quantity is reached if both are set? Regarding the case when there are multiple history server instances, if we don't enforce a behavior, users can go with either a) and b), and it would just be up to the user to choose. We need to document the behavior properly. Thanks, Jiangjie (Becket) Qin On Mon, Aug 18, 2025 at 10:28 PM Yuepeng Pan <panyuep...@apache.org> wrote: Hi, Becket. Thank you for your comments. 1. What is the use case for ttlAndQuantity mode? It seems usually the desired behavior is ttlOrQuantity. If so, we can just add a ttl retention config. The ttlAndQuantity mode means that files in the remote directory can only be retained if their modification time is within the valid TTL and the total number of files does not exceed the maximum limit. One of the main purposes of this configuration item is to impose restrictions on the following situations: - Within the TTL, the number of files grows too large, leading to excessive storage usage or too many files. - Files remain within the file quantity threshold, but their modification times far exceed the TTL. 2. When there are multiple history server instances with different configurations, they are working independently today and may have conflict configs. This is an existing problem, but since we are adding more configs to the retention policy, it increases the chance of config conflicts. It would be good to have a clear user story for when there are multiple history server instances. This is indeed a good question. What do you think if we add a description like the following to the newly introduced configuration item section in the FLIP? a. If there are multiple HistoryServer instances using the same historyserver.archive.fs.dir directory as the refresh directory, you should enable and configure this feature in only one HistoryServer instance to avoid errors caused by multiple instances simultaneously cleaning up remote files. -OR- b. If there are multiple HistoryServer instances using the same historyserver.archive.fs.dir directory as the refresh directory, you need to keep the value of this configuration consistent across them. Regardless of whether option a or option b is chosen, it is necessary to enhance the corresponding exception handling when reading from and deleting remote files. I’m really looking forward to hearing other suitable resolution candidates about the above items. Please let me know your opinion. Best, Yuepeng Pan At 2025-08-19 00:37:26, "Becket Qin" <becket....@gmail.com> wrote: Thanks for the proposal, Yuepeng. I think this FLIP is mostly orthogonal to FLIP-505. This FLIP essentially tries to improve the retention policy of the actual archives, while FLIP-505 mainly focuses on caching. One connection between the two FLIPs might be when the actual archive expires and gets removed, it might make sense to also remove the local cache. A few question about this FlIP: 1. What is the use case for ttlAndQuantity mode? It seems usually the desired behavior is ttlOrQuantity. If so, we can just add a ttl retention config. 2. When there are multiple history server instances with different configurations, they are working independently today and may have conflict configs. This is an existing problem, but since we are adding more configs to the retention policy, it increases the chance of config conflicts. It would be good to have a clear user story for when there are multiple history server instances. Thanks, Jiangjie (Becket) Qin On Thu, Aug 14, 2025 at 1:56 PM Allison <achang5...@gmail.com> wrote: Hi Yuepeng, Looks like this work can have some symbiosis with the change that I've proposed here in FLIP-505. This addresses the question that Ryan asked about whether or not remotely stored job archives will be impacted if the retention is changed. Feel free to take a look at the FLIP as well as the PR for FLIP-505. Looks like we have the opportunity to significantly improve the History server with these two changes. FLIP-505: https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch PR: https://github.com/apache/flink/pull/26878 Best, Allison On Thu, Aug 14, 2025 at 9:51 AM Yuepeng Pan < panyuep...@apache.org> wrote: Hi, Ryan van Huuksloot. Might be worth stating that explicitly in the FLIP. Nice idea~ The sub-section added here[1] to clarify the item. Thanks a lot ! [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857#FLIP490:EnhancedJobHistoryRetentionPoliciesforHistoryServer -Thetimingtocheckwhethertargetfileshaveexceededtheretentionthresholds Best, Yuepeng Pan On 2025/08/14 16:27:39 Ryan van Huuksloot wrote: That sounds like a good option. Might be worth stating that explicitly in the FLIP. No other questions from me - will be a nice extension! Ryan van Huuksloot Staff Engineer, Infrastructure | Streaming Platform [image: Shopify] < https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email On Thu, Aug 14, 2025 at 12:22 PM Yuepeng Pan < panyuep...@apache.org wrote: Hi, Hi, Ryan van Huuksloot. Are you planning on having a thread to check for TTL? Or what is the plan for TTL? The quantity based would have a check when a new job is archived? Just like the implementation in the POC[1], if we continue following the process where HistoryServer#start method periodically invokes HistoryServerArchiveFetcher#fetchArchives based on 'historyserver.archive.fs.refresh-interval' to check whether target files should be retained, what do you think about it ? Of course, I'm very open to hearing about other potentially better implementation approaches. Please let me know what's your opinion. Thank you. [1] https://github.com/apache/flink/pull/26902 Best, Yuepeng Pan On 2025/08/14 16:07:10 Ryan van Huuksloot wrote: Thanks, sounds good. Are you planning on having a thread to check for TTL? Or what is the plan for TTL? The quantity based would have a check when a new job is archived? Ryan van Huuksloot Staff Engineer, Infrastructure | Streaming Platform [image: Shopify] < https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email On Thu, Aug 14, 2025 at 12:04 PM Yuepeng Pan < panyuep...@apache.org> wrote: Hi, Ryan van Huuksloot. Thank you very much for your reply. > Question: Is the History Server then going to delete the files stored? > (i.e. we use GCS, would it delete the files there as well?) > Or is this strictly what is shown in the UI? Yes, this feature introduced in the FLIP is a super-set of the original feature that is controlled by 'historyserver.archive.retained-jobs'. So if I understand correctly, after the new feature is introduced, it would affect the retention period of remote distributed storage jobs history files as well, not only for what is shown in the UI. Best, Yuepeng Pan At 2025-08-14 23:34:54, "Ryan van Huuksloot" <ryan.vanhuuksl...@shopify.com.INVALID> wrote: I took a look. Overall it would be nice to have more ways to configure the History Server. Question: Is the History Server then going to delete the files stored? (i.e. we use GCS, would it delete the files there as well?) Or is this strictly what is shown in the UI? Ryan van Huuksloot Staff Engineer, Infrastructure | Streaming Platform [image: Shopify] < https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> On Thu, Aug 14, 2025 at 11:17 AM Yuepeng Pan < panyuep...@apache.org> wrote: Bumping this thread. Thanks! Best, Yuepeng Pan On 2025/08/11 03:49:27 Yuepeng Pan wrote: Hi community, Currently, HistoryServer supports only a quantity-based job archive retention policy [1]. This is insufficient for scenarios such as: - Time-based retention (e.g., last X days). - Combined rules (e.g., within 7 days AND ≤100 jobs). To address these limitations, I’d like to start a discussion on FLIP-490 [2], which proposes a more flexible job archive retention mechanism that supports time-based, quantity-based, and composite strategies (with AND/OR logic). Looking forward to your feedback. Best, Yuepeng Pan [1] https://github.com/apache/flink/blob/cae5fb4d3b6d9e0c10c3539ea4994fc1ad463b70/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServer.java#L241 [2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857