[DISCUSS] Removal of periodic state dumps

Craig Condit Mon, 19 Dec 2022 16:18:27 -0800

All,

I’d like to open a discussion about the future of the periodic state dump 
feature. To jumpstart the discussion, I opened 
https://issues.apache.org/jira/browse/YUNIKORN-1483, which is copied below for 
context. In the process of writing this up, it seems to me that we might 
actually be better off simply removing the feature, and relying solely on the 
REST API to retrieve state dumps on demand.


In the current state, periodic state dumps need to be enabled, at which point 
they write to a local filesystem within the YuniKorn scheduler. This maps onto 
ephemeral storage, so to avoid out-of-space scenarios, an administrator needs 
to customize the YK Helm deployment with additional resource quota. 
Additionally, to even access the dumps, the filesystem needs to be mounted as a 
persistent volume and external code written to interact with the saved dumps. 
Given the mixed text-and-json format of these dumps, this can be rather 
complicated.

Alternatively, users could simply deploy a cron container which pulls the state 
dump on-demand from the existing REST API. This ends up being considerably 
simpler.

Are there objections to removing the existing periodic state dump 
functionality? Existing users who would be impacted greatly? To be clear, I’m 
not proposing removing the state dump itself; the version available via the 
REST API has proven extremely valuable. All that is on the table is removal of 
the automatic, periodic state dump which writes to local files.

Looking forward to feedback,

Craig



------------------------------------
YUNIKORN-1483 write-up:

The current support for generating periodic state dumps implemented in 
YUNIKORN-940 <https://issues.apache.org/jira/browse/YUNIKORN-940> has several 
warts:
The configuration in YUNIKORN-949 
<https://issues.apache.org/jira/browse/YUNIKORN-949> is done via the core 
scheduler configuration, leading to a random option on partitions which doesn't 
belong there and has nothing to do with scheduling.
Changing the frequency of the state dumps is done via the unsecured REST API. 
This is a potential denial-of-service vector.
Configuration V2 is now complete, which standardizes on using a ConfigMap to 
configure all YuniKorn options that make sense to be reconfigured. However, 
allowing the location to be changed at runtime makes no sense in a 
containerized environment.
Retrieving the state dumps requires mounting of external storage. This is 
necessarily a site-specific configuration and currently requires a custom Helm 
deployment.
The state dumps, though JSON, are emitted as text files with JSON appended to 
them, making parsing difficult.
To address these issues:
Deprecate existing REST API configuration for frequency, and make it a no-op 
now for security reasons. We can remove it completely in 2.0.
Deprecate the statedumpfilepath option on partitions. Ignore it for security 
reasons now (and warn if found), and remove completely in 2.0.
Disable the feature by default. To enable it, we should require setting a 
specific environment variable:
YUNIKORN_STATE_DUMP_LOCATION=/path/to/dir : This would be required to enable 
the feature at all. Making it an env var makes sense as it is not an option 
that should be reconfigured (or even visible) in configuration.
Via configmap, we should allow the feature to be enabled / disabled and its 
frequency set. These options would have no effect if 
YUNIKORN_STATE_DUMP_LOCATION is not defined:
periodicStateDump.enabled: "true" | "false" (default "false")
periodicStateDump.frequency: "15m" (default value, do not allow more frequently 
than 1m intervals)
periodicStateDump.count: 10 (default value)
Create an empty directory /yunkorn-state in the Docker image to store state 
dumps.
Add support to Helm for enabling state dump support as well as setting custom 
mount options (including quota). Enabling support should set the env var 
YUNIKORN_STATE_DUMP_LOCATION=/yunikorn-state and mount this directory via the 
options specified.
Output a single json file per dump and remove oldest files until count <= 
periodicStateDump.count entries: yunikorn-state-dump-YYYYMMDD-HHMM.json

[DISCUSS] Removal of periodic state dumps

Reply via email to