Would anyone like to discuss this FLIP? I'd appreciate your feedback and suggestions.
Best, Yuepeng Pan On 2025/08/20 07:13:44 Yuepeng Pan wrote: > Hi, Becket. > > Thank you for the clarification. > Please let me have a try on revisiting these two questions with a explanation: > > > I meant to ask what is the use case for > > ttlOrQuantity mode? Is it sufficient to delete the job archive when either > > TTL or quantity is reached if both are set? > > As the configuration key 'historyserver.archive.retained-jobs.mode' literally > suggests, > this policy governs the retention mode for archived historical jobs. > When set to 'ttlOrQuantity', a target file will be retained if either of the > following conditions is met (in other words, deletion occurs only if both > conditions are unsatisfied): > > - The file count is within the maximum retention threshold. > - The file remains within the TTL (Time to Live) period. > > >Regarding the case when there are multiple history server instances, if we > >don't enforce a behavior, users can go with either a) and b), and it would > >just be up to the user to choose. We need to document the behavior properly. > > Thanks for the comment. And I added the related content as note/comment[1] of > the new configuration 'historyserver.archive.retained-jobs.mode' . > In the subsequent implementation phase, this part of the description will be > refined and added to the corresponding configuration documentation. > > Best, > Yuepeng Pan. > > [1] > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857#FLIP490:EnhancedJobHistoryRetentionPoliciesforHistoryServer-PublicInterfaces > > > > On 2025/08/20 04:55:24 Becket Qin wrote: > > Hi Yuepeng, > > > > Sorry for the confusion. I meant to ask what is the use case for > > ttlOrQuantity mode? Is it sufficient to delete the job archive when either > > TTL or quantity is reached if both are set? > > > > Regarding the case when there are multiple history server instances, if we > > don't enforce a behavior, users can go with either a) and b), and it would > > just be up to the user to choose. We need to document the behavior properly. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > > > On Mon, Aug 18, 2025 at 10:28 PM Yuepeng Pan <panyuep...@apache.org> wrote: > > > > > Hi, Becket. > > > > > > Thank you for your comments. > > > > > > > 1. What is the use case for ttlAndQuantity mode? It seems usually the > > > > > > > desired behavior is ttlOrQuantity. If so, we can just add a ttl > > > retention config. > > > > > > > > > > > > > > > The ttlAndQuantity mode means that files in the remote directory can only > > > be retained if their modification time is within the valid TTL > > > > > > and the total number of files does not exceed the maximum limit. > > > > > > One of the main purposes of this configuration item is to impose > > > restrictions on the following situations: > > > > > > - Within the TTL, the number of files grows too large, leading to > > > excessive storage usage or too many files. > > > > > > - Files remain within the file quantity threshold, but their modification > > > times far exceed the TTL. > > > > > > > > > > > > > > > > 2. When there are multiple history server instances with different > > > > > > > configurations, they are working independently today and may have > > > conflict > > > > > > > configs. This is an existing problem, but since we are adding more > > > configs > > > > > > > to the retention policy, it increases the chance of config conflicts. It > > > > > > > would be good to have a clear user story for when there are multiple > > > history server instances. > > > > > > > > > > > > > > > This is indeed a good question. > > > > > > What do you think if we add a description like the following to the newly > > > introduced configuration item section in the FLIP? > > > > > > a. If there are multiple HistoryServer instances using the same > > > historyserver.archive.fs.dir directory as the refresh directory, > > > > > > you should enable and configure this feature in only one HistoryServer > > > instance to avoid errors caused by multiple instances simultaneously > > > cleaning up remote files. > > > > > > -OR- > > > > > > b. If there are multiple HistoryServer instances using the same > > > historyserver.archive.fs.dir directory as the refresh directory, > > > > > > you need to keep the value of this configuration consistent across them. > > > > > > > > > > > > > > > Regardless of whether option a or option b is chosen, it is necessary to > > > enhance the corresponding exception handling when reading from and > > > deleting > > > remote files. > > > > > > > > > > > > > > > I’m really looking forward to hearing other suitable resolution candidates > > > about the above items. > > > > > > Please let me know your opinion. > > > > > > Best, > > > Yuepeng Pan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2025-08-19 00:37:26, "Becket Qin" <becket....@gmail.com> wrote: > > > >Thanks for the proposal, Yuepeng. > > > > > > > >I think this FLIP is mostly orthogonal to FLIP-505. This FLIP essentially > > > >tries to improve the retention policy of the actual archives, while > > > >FLIP-505 mainly focuses on caching. One connection between the two FLIPs > > > >might be when the actual archive expires and gets removed, it might make > > > >sense to also remove the local cache. > > > > > > > >A few question about this FlIP: > > > > > > > >1. What is the use case for ttlAndQuantity mode? It seems usually the > > > >desired behavior is ttlOrQuantity. If so, we can just add a ttl retention > > > >config. > > > >2. When there are multiple history server instances with different > > > >configurations, they are working independently today and may have > > > >conflict > > > >configs. This is an existing problem, but since we are adding more > > > >configs > > > >to the retention policy, it increases the chance of config conflicts. It > > > >would be good to have a clear user story for when there are multiple > > > >history server instances. > > > > > > > >Thanks, > > > > > > > >Jiangjie (Becket) Qin > > > > > > > >On Thu, Aug 14, 2025 at 1:56 PM Allison <achang5...@gmail.com> wrote: > > > > > > > >> Hi Yuepeng, > > > >> > > > >> Looks like this work can have some symbiosis with the change that I've > > > >> proposed here in FLIP-505. This addresses the question that Ryan asked > > > >> about whether or not remotely stored job archives will be impacted if > > > the > > > >> retention is changed. Feel free to take a look at the FLIP as well as > > > the > > > >> PR for FLIP-505. Looks like we have the opportunity to significantly > > > >> improve the History server with these two changes. > > > >> > > > >> FLIP-505: > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch > > > >> PR: https://github.com/apache/flink/pull/26878 > > > >> > > > >> Best, > > > >> Allison > > > >> > > > >> > > > >> On Thu, Aug 14, 2025 at 9:51 AM Yuepeng Pan <panyuep...@apache.org> > > > wrote: > > > >> > > > >> > Hi, Ryan van Huuksloot. > > > >> > > > > >> > > Might be worth stating that explicitly in the FLIP. > > > >> > Nice idea~ The sub-section added here[1] to clarify the item. > > > >> > > > > >> > Thanks a lot ! > > > >> > > > > >> > [1] > > > >> > > > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857#FLIP490:EnhancedJobHistoryRetentionPoliciesforHistoryServer > > > >> > -Thetimingtocheckwhethertargetfileshaveexceededtheretentionthresholds > > > >> > > > > >> > Best, > > > >> > Yuepeng Pan > > > >> > > > > >> > On 2025/08/14 16:27:39 Ryan van Huuksloot wrote: > > > >> > > That sounds like a good option. > > > >> > > > > > >> > > Might be worth stating that explicitly in the FLIP. > > > >> > > > > > >> > > No other questions from me - will be a nice extension! > > > >> > > > > > >> > > Ryan van Huuksloot > > > >> > > Staff Engineer, Infrastructure | Streaming Platform > > > >> > > [image: Shopify] > > > >> > > < > > > >> https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email > > > >> > > > > > >> > > > > > >> > > > > > >> > > On Thu, Aug 14, 2025 at 12:22 PM Yuepeng Pan <panyuep...@apache.org > > > > > > > >> > wrote: > > > >> > > > > > >> > > > Hi, Hi, Ryan van Huuksloot. > > > >> > > > > > > >> > > > >Are you planning on having a thread to check for TTL? Or what is > > > the > > > >> > plan > > > >> > > > >for TTL? > > > >> > > > >The quantity based would have a check when a new job is > > > >> > > > >archived? > > > >> > > > > > > >> > > > Just like the implementation in the POC[1], if we continue > > > following > > > >> > the > > > >> > > > process where > > > >> > > > HistoryServer#start method periodically invokes > > > >> > > > HistoryServerArchiveFetcher#fetchArchives > > > >> > > > based on 'historyserver.archive.fs.refresh-interval' to check > > > >> > > > whether target files should be retained, what do you think about > > > it ? > > > >> > > > Of course, I'm very open to hearing about other potentially > > > >> > > > better > > > >> > > > implementation approaches. > > > >> > > > Please let me know what's your opinion. > > > >> > > > Thank you. > > > >> > > > > > > >> > > > [1] https://github.com/apache/flink/pull/26902 > > > >> > > > > > > >> > > > Best, > > > >> > > > Yuepeng Pan > > > >> > > > > > > >> > > > > > > >> > > > On 2025/08/14 16:07:10 Ryan van Huuksloot wrote: > > > >> > > > > Thanks, sounds good. > > > >> > > > > > > > >> > > > > Are you planning on having a thread to check for TTL? Or what > > > >> > > > > is > > > >> the > > > >> > plan > > > >> > > > > for TTL? > > > >> > > > > The quantity based would have a check when a new job is > > > archived? > > > >> > > > > > > > >> > > > > Ryan van Huuksloot > > > >> > > > > Staff Engineer, Infrastructure | Streaming Platform > > > >> > > > > [image: Shopify] > > > >> > > > > < > > > >> > > > > https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > On Thu, Aug 14, 2025 at 12:04 PM Yuepeng Pan < > > > >> panyuep...@apache.org> > > > >> > > > wrote: > > > >> > > > > > > > >> > > > > > Hi, Ryan van Huuksloot. > > > >> > > > > > > > > >> > > > > > Thank you very much for your reply. > Question: Is the > > > >> > > > > > History > > > >> > Server > > > >> > > > then > > > >> > > > > > going to delete the files stored? > (i.e. we use GCS, would > > > >> > > > > > it > > > >> > delete > > > >> > > > the > > > >> > > > > > files there as well?) > Or is this strictly what is shown in > > > the > > > >> > UI? > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > Yes, this feature introduced in the FLIP is a super-set of > > > >> > > > > > the > > > >> > original > > > >> > > > > > feature that is controlled by > > > >> > 'historyserver.archive.retained-jobs'. > > > >> > > > > > > > > >> > > > > > So if I understand correctly, after the new feature is > > > >> introduced, > > > >> > it > > > >> > > > > > would affect the retention period of remote distributed > > > storage > > > >> > jobs > > > >> > > > > > history files as well, not only for what is shown in the UI. > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > Best, > > > >> > > > > > Yuepeng Pan > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > At 2025-08-14 23:34:54, "Ryan van Huuksloot" > > > >> > > > > > <ryan.vanhuuksl...@shopify.com.INVALID> wrote: > > > >> > > > > > >I took a look. Overall it would be nice to have more ways to > > > >> > > > configure the > > > >> > > > > > >History Server. > > > >> > > > > > > > > > >> > > > > > >Question: Is the History Server then going to delete the > > > files > > > >> > stored? > > > >> > > > > > >(i.e. we use GCS, would it delete the files there as well?) > > > >> > > > > > >Or is this strictly what is shown in the UI? > > > >> > > > > > > > > > >> > > > > > >Ryan van Huuksloot > > > >> > > > > > >Staff Engineer, Infrastructure | Streaming Platform > > > >> > > > > > >[image: Shopify] > > > >> > > > > > >< > > > >> > > > > > > >> > > > > https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > >On Thu, Aug 14, 2025 at 11:17 AM Yuepeng Pan < > > > >> > panyuep...@apache.org> > > > >> > > > > > wrote: > > > >> > > > > > > > > > >> > > > > > >> Bumping this thread. Thanks! > > > >> > > > > > >> > > > >> > > > > > >> Best, > > > >> > > > > > >> Yuepeng Pan > > > >> > > > > > >> > > > >> > > > > > >> On 2025/08/11 03:49:27 Yuepeng Pan wrote: > > > >> > > > > > >> > Hi community, > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > Currently, HistoryServer supports only a quantity-based > > > job > > > >> > > > archive > > > >> > > > > > >> retention policy [1]. > > > >> > > > > > >> > This is insufficient for scenarios such as: > > > >> > > > > > >> > - Time-based retention (e.g., last X days). > > > >> > > > > > >> > - Combined rules (e.g., within 7 days AND ≤100 jobs). > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > To address these limitations, I’d like to start a > > > discussion > > > >> > on > > > >> > > > > > FLIP-490 > > > >> > > > > > >> [2], > > > >> > > > > > >> > which proposes a more flexible job archive retention > > > >> mechanism > > > >> > > > that > > > >> > > > > > >> supports time-based, quantity-based, and composite > > > strategies > > > >> > (with > > > >> > > > > > AND/OR > > > >> > > > > > >> logic). > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > Looking forward to your feedback. > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > Best, > > > >> > > > > > >> > Yuepeng Pan > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > [1] > > > >> > > > > > >> > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > > https://github.com/apache/flink/blob/cae5fb4d3b6d9e0c10c3539ea4994fc1ad463b70/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServer.java#L241 > > > >> > > > > > >> > [2] > > > >> > > > > > >> > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857 > > > >> > > > > > >> > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > >