Re: [DISCUSS] FLIP-490: Enhanced Job History Retention Policies for HistoryServer

Roc Marshal Sun, 24 Aug 2025 18:48:10 -0700

Hi, Becket.

Thank you very much for your reply.  It’s fine to me. I have also updated the 
wiki based on the comments.

Looking forward to your next review.

Best,

Yuepeng Pan.

I think it would be better to keep the configurations straight forward
instead of conditional if possible.
How about just adding one config of TTL. We will remove a job archive
either when its TTL has passed, or when the retained job count has been
reached and the job is the earliest job.

Thanks,

Jiangjie (Becket) Qin

On Fri, Aug 22, 2025 at 2:55 AM Yuepeng Pan <[email protected]> wrote:

Would anyone like to discuss this FLIP?
I'd appreciate your feedback and suggestions.

Best,
Yuepeng Pan

On 2025/08/20 07:13:44 Yuepeng Pan wrote:
Hi, Becket.

Thank you for the clarification.
Please let me have a try on revisiting these two questions with a
explanation:

I meant to ask what is the use case for
ttlOrQuantity mode? Is it sufficient to delete the job archive when
either
TTL or quantity is reached if both are set?

As the configuration key 'historyserver.archive.retained-jobs.mode'
literally suggests,
this policy governs the retention mode for archived historical jobs.
When set to 'ttlOrQuantity', a target file will be retained if either of
the following conditions is met (in other words, deletion occurs only if
both conditions are unsatisfied):

- The file count is within the maximum retention threshold.
- The file remains within the TTL (Time to Live) period.

Regarding the case when there are multiple history server instances, if
we
don't enforce a behavior, users can go with either a) and b), and it
would
just be up to the user to choose. We need to document the behavior
properly.

Thanks for the comment. And I added the related content as
note/comment[1] of the new configuration
'historyserver.archive.retained-jobs.mode' .
In the subsequent implementation phase, this part of the description
will be refined and added to the corresponding configuration documentation.

Best,
Yuepeng Pan.

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857#FLIP490:EnhancedJobHistoryRetentionPoliciesforHistoryServer-PublicInterfaces

On 2025/08/20 04:55:24 Becket Qin wrote:
Hi Yuepeng,

Sorry for the confusion. I meant to ask what is the use case for
ttlOrQuantity mode? Is it sufficient to delete the job archive when
either
TTL or quantity is reached if both are set?

Regarding the case when there are multiple history server instances,
if we
don't enforce a behavior, users can go with either a) and b), and it
would
just be up to the user to choose. We need to document the behavior
properly.

Thanks,

Jiangjie (Becket) Qin

On Mon, Aug 18, 2025 at 10:28 PM Yuepeng Pan <[email protected]>
wrote:

Hi, Becket.

Thank you for your comments.

1. What is the use case for ttlAndQuantity mode? It seems usually
the

desired behavior is ttlOrQuantity. If so, we can just add a ttl
retention config.

The ttlAndQuantity mode means that files in the remote directory can
only
be retained if their modification time is within the valid TTL

and the total number of files does not exceed the maximum limit.

One of the main purposes of this configuration item is to impose
restrictions on the following situations:

- Within the TTL, the number of files grows too large, leading to
excessive storage usage or too many files.

- Files remain within the file quantity threshold, but their
modification
times far exceed the TTL.

2. When there are multiple history server instances with different

configurations, they are working independently today and may have
conflict

configs. This is an existing problem, but since we are adding more
configs

to the retention policy, it increases the chance of config
conflicts. It

would be good to have a clear user story for when there are
multiple
history server instances.

This is indeed a good question.

What do you think if we add a description like the following to the
newly
introduced configuration item section in the FLIP?

a. If there are multiple HistoryServer instances using the same
historyserver.archive.fs.dir directory as the refresh directory,

you should enable and configure this feature in only one
HistoryServer
instance to avoid errors caused by multiple instances simultaneously
cleaning up remote files.

-OR-

b. If there are multiple HistoryServer instances using the same
historyserver.archive.fs.dir directory as the refresh directory,

you need to keep the value of this configuration consistent across
them.

Regardless of whether option a or option b is chosen, it is
necessary to
enhance the corresponding exception handling when reading from and
deleting
remote files.

I’m really looking forward to hearing other suitable resolution
candidates
about the above items.

Please let me know your opinion.

Best,
Yuepeng Pan

At 2025-08-19 00:37:26, "Becket Qin" <[email protected]> wrote:
Thanks for the proposal, Yuepeng.

I think this FLIP is mostly orthogonal to FLIP-505. This FLIP
essentially
tries to improve the retention policy of the actual archives, while
FLIP-505 mainly focuses on caching. One connection between the two
FLIPs
might be when the actual archive expires and gets removed, it might
make
sense to also remove the local cache.

A few question about this FlIP:

1. What is the use case for ttlAndQuantity mode? It seems usually
the
desired behavior is ttlOrQuantity. If so, we can just add a ttl
retention
config.
2. When there are multiple history server instances with different
configurations, they are working independently today and may have
conflict
configs. This is an existing problem, but since we are adding more
configs
to the retention policy, it increases the chance of config
conflicts. It
would be good to have a clear user story for when there are multiple
history server instances.

Thanks,

Jiangjie (Becket) Qin

On Thu, Aug 14, 2025 at 1:56 PM Allison <[email protected]>
wrote:

Hi Yuepeng,

Looks like this work can have some symbiosis with the change that
I've
proposed here in FLIP-505. This addresses the question that Ryan
asked
about whether or not remotely stored job archives will be
impacted if
the
retention is changed. Feel free to take a look at the FLIP as
well as
the
PR for FLIP-505. Looks like we have the opportunity to
significantly
improve the History server with these two changes.

FLIP-505:

https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch
PR: https://github.com/apache/flink/pull/26878

Best,
Allison

On Thu, Aug 14, 2025 at 9:51 AM Yuepeng Pan <
[email protected]>
wrote:

Hi, Ryan van Huuksloot.

Might be worth stating that explicitly in the FLIP.
Nice idea~ The sub-section added here[1] to clarify the item.

Thanks a lot !

[1]

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857#FLIP490:EnhancedJobHistoryRetentionPoliciesforHistoryServer

-Thetimingtocheckwhethertargetfileshaveexceededtheretentionthresholds

Best,
Yuepeng Pan

On 2025/08/14 16:27:39 Ryan van Huuksloot wrote:
That sounds like a good option.

Might be worth stating that explicitly in the FLIP.

No other questions from me - will be a nice extension!

Ryan van Huuksloot
Staff Engineer, Infrastructure | Streaming Platform
[image: Shopify]
<

https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email

On Thu, Aug 14, 2025 at 12:22 PM Yuepeng Pan <
[email protected]

wrote:

Hi, Hi, Ryan van Huuksloot.

Are you planning on having a thread to check for TTL? Or
what is
the
plan
for TTL?
The quantity based would have a check when a new job is
archived?

Just like the implementation in the POC[1], if we continue
following
the
process where
HistoryServer#start method periodically invokes
HistoryServerArchiveFetcher#fetchArchives
based on 'historyserver.archive.fs.refresh-interval' to
check
whether target files should be retained, what do you think
about
it ?
Of course, I'm very open to hearing about other potentially
better
implementation approaches.
Please let me know what's your opinion.
Thank you.

[1] https://github.com/apache/flink/pull/26902

Best,
Yuepeng Pan

On 2025/08/14 16:07:10 Ryan van Huuksloot wrote:
Thanks, sounds good.

Are you planning on having a thread to check for TTL? Or
what is
the
plan
for TTL?
The quantity based would have a check when a new job is
archived?

Ryan van Huuksloot
Staff Engineer, Infrastructure | Streaming Platform
[image: Shopify]
<

https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email

On Thu, Aug 14, 2025 at 12:04 PM Yuepeng Pan <
[email protected]>
wrote:

Hi, Ryan van Huuksloot.

Thank you very much for your reply. > Question: Is the
History
Server
then
going to delete the files stored? > (i.e. we use GCS,
would it
delete
the
files there as well?) > Or is this strictly what is
shown in
the
UI?

Yes, this feature introduced in the FLIP is a super-set
of the
original
feature that is controlled by
'historyserver.archive.retained-jobs'.

So if I understand correctly, after the new feature is
introduced,
it
would affect the retention period of remote distributed
storage
jobs
history files as well, not only for what is shown in
the UI.

Best,
Yuepeng Pan

At 2025-08-14 23:34:54, "Ryan van Huuksloot"
<[email protected]> wrote:
I took a look. Overall it would be nice to have more
ways to
configure the
History Server.

Question: Is the History Server then going to delete
the
files
stored?
(i.e. we use GCS, would it delete the files there as
well?)
Or is this strictly what is shown in the UI?

Ryan van Huuksloot
Staff Engineer, Infrastructure | Streaming Platform
[image: Shopify]
<

https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>

On Thu, Aug 14, 2025 at 11:17 AM Yuepeng Pan <
[email protected]>
wrote:

Bumping this thread. Thanks!

Best,
Yuepeng Pan

On 2025/08/11 03:49:27 Yuepeng Pan wrote:
Hi community,

Currently, HistoryServer supports only a
quantity-based
job
archive
retention policy [1].
This is insufficient for scenarios such as:
- Time-based retention (e.g., last X days).
- Combined rules (e.g., within 7 days AND ≤100
jobs).

To address these limitations, I’d like to start a
discussion
on
FLIP-490
[2],
which proposes a more flexible job archive
retention
mechanism
that
supports time-based, quantity-based, and composite
strategies
(with
AND/OR
logic).

Looking forward to your feedback.

Best,
Yuepeng Pan

[1]

https://github.com/apache/flink/blob/cae5fb4d3b6d9e0c10c3539ea4994fc1ad463b70/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/history/HistoryServer.java#L241
[2]

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=332499857

Re: [DISCUSS] FLIP-490: Enhanced Job History Retention Policies for HistoryServer

Reply via email to