Thanks for the FLIP, Allison. The proposal makes a lot of sense in general. The history server is critical to the Flink batch.
A few suggestions: 1. It might make sense to keep the existing config *historyserver.archive.retained-jobs*. This will only be used to determine the total number of jobs to keep in the remote storage. 2. The new configuration *historyserver.archive.cached-retained-jobs* only determines the number of jobs cached locally. The default value is -1. And the valid range is* [1, historyserver.archive.retained-jobs].* When not set, the it basically caches everything, which is the current behavior. When set, that basically means the history server is in the "partial caching mode" rather than the "full mirror mode". 3. Maybe we don't need the *historyserver.archive.remote-fetch-enabled *config. This config is a little confusing because the jobs history is fetched remotely even now. The difference is whether we fetch everything as a whole or fetch individual jobs on demand. But this isan internal implementation detail and is not necessary to expose to the end users. 4. The config *historyserver.archive.num-cached-most-recently-viewed-jobs* always takes effect, regardless of whether the history server is running in a "full mirror mode" or "partial caching mode". So with the above settings: 1. By default, users get the same behavior as today. 2. When users set *historyserver.archive.cached-retained-jobs, *the history server enters the partial caching mode and fetches the jobs on demand. 3. Some most recently viewed jobs are automatically pinned in the cache so they will not be evicted accidentally and cause cache thrashing. BTW, It would be good to add a future work part to give a heads-up about the plan to use RocksDB for job history rather than raw files. Thanks, Jiangjie (Becket) Qin On Mon, May 12, 2025 at 11:02 AM Venkatakrishnan Sowrirajan < vsowr...@asu.edu> wrote: > > Regarding decoupling the two features, would your suggestion be to > separate > them into two separate FLIPs? > > Sorry for the late response. > > Yes, that is correct. If these 2 features are somewhat coupled with each > other, then it makes sense to address it in the same FLIP otherwise I think > it will be better to tackle it as 2 different FLIPs. > > Regards > Venkata krishnan > > > On Mon, Mar 3, 2025 at 1:42 PM Allison <achang5...@gmail.com> wrote: > > > Hi Yanquan, > > > > I've updated the FLIP to contain the default values, thanks for your > help! > > > > Sincerely > > - Allison > > > > On Thu, Jan 30, 2025 at 3:21 AM Yanquan Lv <decq12y...@gmail.com> wrote: > > > > > Thank you for your explanation. I have basically solved the previous > > > questions. > > > > > > Regarding the second point, I would like to suggest clarifying the > > default > > > values for newly adding parameters in `Public Interfaces` session. > > > > > > ---------- Forwarded message --------- > > > 发件人: Allison <achang5...@gmail.com> > > > Date: 2025年1月30日周四 上午3:42 > > > Subject: Re: [DISCUSS] FLIP-505: Flink History Server Scability > > > Improvements, Remote Data Store Fetch and Per Job Fetch > > > To: <dev@flink.apache.org> > > > > > > > > > Hi Yanquan, > > > > > > Thanks for taking a look at this. Re: your questions: > > > > > > 1. Yes, I've updated the FLIP to be more clear, but it involves > modifying > > > the existing configuration of historyserver.archive.retained-jobs to > > > historyserver.archive.cached-retained-jobs. The number of remote-jobs > > > stored can be infinite, the thought behind this is that the remote data > > > storage can be cleaned up or limited by a separate protocol that can be > > > customized to each individual use case. > > > 2. Could you clarify this a bit? I'm not sure I understand this part, > do > > > you mean to add what the configurations would be set to in the case of > > them > > > not being defined to the FLIP? > > > 3. historyserver.archive.fs.refresh-interval is the time duration > > between a > > > call to the remote data storage to find fresh data. What it configures > is > > > how often the FHS polls the remote data store for new files. The remote > > > data store is written to whenever a job is finished. > > > > > > Hope this clarifies some things. > > > > > > Best, > > > - Allison > > > > > > > > > On Mon, Jan 27, 2025 at 7:10 PM Yanquan Lv <decq12y...@gmail.com> > wrote: > > > > > > > Hi, Allison. Thanks for driving this FLIP. > > > > I have some questions to confirm: > > > > > > > > 1. I can’t find any existed configuration name > > > > `historyserver.archive.cached-retained-jobs`, I guess that what you > > mean > > > is > > > > modifing existing configuration from > > > `historyserver.archive.retained-jobs` > > > > to `historyserver.archive.cached-retained-jobs`. If so, If we only > > limit > > > > the number of retained-jobs stored locally, is the number of > > > retained-jobs > > > > stored remotely infinite? > > > > 2. I think it would be better to provide instructions for adding > > default > > > > values to HistoryServerOptions. > > > > 3. Does `historyserver.archive.fs.refresh-interval` apply to both > local > > > and > > > > remote storage simultaneously? > > > > > > > > Best, > > > > Yanquan > > > > > > > > Allison <achang5...@gmail.com> 于 2025年1月17日周五 上午8:07写道: > > > > > > > > > Hi everyone, > > > > > > > > > > I would like to initiate a discussion for the FLIP below, which > > > enhances > > > > to > > > > > the Flink History Server to allow greater scalability of the > service. > > > > > > > > > > Motivation: > > > > > > > > > > Currently, the Flink History Server (FHS) is limited in the number > of > > > job > > > > > archives it can serve based on the storage capacity of the node > that > > > the > > > > > FHS runs in. Job archives are stored locally in a cache which > > creates a > > > > > local directory which is expanded out based on the contents of a > > single > > > > > json archive file. This not only uses up local memory space, but > also > > > > > because of how the FHS expands the job archives into a nested > > directory > > > > > structure, for jobs with a large number of taskmanagers or > subtasks, > > > > inode > > > > > space often runs out. In order to make the FHS more performant, we > > > would > > > > > like to introduce the ability to decouple the job archive storage > for > > > the > > > > > FHS from being limited to the local cache, to being able to store > and > > > > fetch > > > > > jobs archives from a remote file store. > > > > > > > > > > FLIP proposal document: > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*505*3A*Flink*History*Server*Scability*Improvements*2C*Remote*Data*Store*Fetch*and*Per*Job*Fetch__;KyUrKysrKyUrKysrKysrKw!!IKRxdwAv5BmarQ!cy7YUT3RVhkz3ixGuldCgf5lTCb3IMzUuAUClyB3qRuI0vAjYfvNVmw2NOggm06YnRGkmQ-3hMpOp0Ot7yRPK54$ > > > > > > > > > > Thanks! > > > > > > > > > > Best, > > > > > - Allison Chang > > > > > > > > > > > > > > >