Thanks.
On 05-05-2026 01:15 pm, Gregory Price wrote:
> In the scenario i'm talking about, a "write budget" is defined as a
> number of pages that are allows to be mapped writable in the page
> tables at any given time.
> Agree. I was also in the same context.
I am trying to bring the device perspective here, and would like to
discuss a few corner cases and possible solutions.
As I see, solving the compressed memory problem statement has these
aspects mainly:
1) Allocation control: private/managed memory concept.
2) Write control: write-protected PTEs, write-controlled use cases like
ZSWAP
3) Proactive reclaims: optional methods to ease back-pressure using
memory shrinkers, ballooning, kswapd, promotion etc. These methods will
be triggered based on notifications/interrupts from the device.
May be they are not enough to cover some corner cases for cram!
I believe that this thin-provisioned memory infra is susceptible to
'writes-above-media-capacity corner cases' (because of not handling
device back-pressure notifications in time) whichever methods we use in
the kernel. Even if we use write-controlled methods like ZSWAP and
pro-active reclaims, there could be corner cases where the communication
with the device could be broken and the write path is not aware of it
immediately. Note that OCP spec [1] says the device should mark the
memory location as 'poisoned' in 'over-capacity' writes.
So I have the following proposals / options for this scenario.
Option 1: Poisoned data management - This is about accepting that
poisoning of memory locations can happen in much more regular frequency
here than regular memories and we need to figure out potential recovery
mechanisms in host (not recovery of data; but recovery from the poison
situation). But I guess folks will not be okay with it in general, and I
am not aware of any workloads where data poisoning is tolerated (may be
caching workloads?).
Option 2 (preferred): Device assisted write budgeting - This is
about a device aware / assisted mechanism for the write-controlled
use-cases (Ex: ZSWAP) to know the 'safe number of writes' that can be
performed to the device (Or allows to be mapped writable in the page
tables). This could be like a 'token bucket' algorithm, where the device
provides a 'budget / set of tokens' to the host. And it need to be
replenished periodically in the device communication code path; and if
the host does not find the token, writes cannot go ahead.
In short, the communication with the device has to be maintained to make
pages mapped writable. For MVP, this could be a simple constraint of
checking actual device capacity periodically to replenish write-budget
for CRAM. For other users of private nodes (GPU memory?), this
constraint may not be needed at all.
We are planning to send an RFC code which will fit into your CRAM infra
to discuss this poison management approach further.
[1]:
https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf
~Arun George