Hi everyone, I would like to initiate a discussion for the FLIP below, which enhances to the Flink History Server to allow greater scalability of the service.
Motivation: Currently, the Flink History Server (FHS) is limited in the number of job archives it can serve based on the storage capacity of the node that the FHS runs in. Job archives are stored locally in a cache which creates a local directory which is expanded out based on the contents of a single json archive file. This not only uses up local memory space, but also because of how the FHS expands the job archives into a nested directory structure, for jobs with a large number of taskmanagers or subtasks, inode space often runs out. In order to make the FHS more performant, we would like to introduce the ability to decouple the job archive storage for the FHS from being limited to the local cache, to being able to store and fetch jobs archives from a remote file store. FLIP proposal document: https://cwiki.apache.org/confluence/display/FLINK/FLIP+505%3A+Flink+History+Server+Scability+Improvements%2C+Remote+Data+Store+Fetch+and+Per+Job+Fetch Thanks! Best, - Allison Chang