Thanks.

On 05-05-2026 01:15 pm, Gregory Price wrote:
> In the scenario i'm talking about, a "write budget" is defined as a
> number of pages that are allows to be mapped writable in the page
> tables at any given time.
> Agree. I was also in the same context.

I am trying to bring the device perspective here, and would like to 
discuss a few corner cases and possible solutions.

As I see, solving the compressed memory problem statement has these 
aspects mainly:

1) Allocation control: private/managed memory concept.
2) Write control: write-protected PTEs, write-controlled use cases like 
ZSWAP
3) Proactive reclaims: optional methods to ease back-pressure using 
memory shrinkers, ballooning, kswapd, promotion etc. These methods will 
be triggered based on notifications/interrupts from the device.

May be they are not enough to cover some corner cases for cram!

  I believe that this thin-provisioned memory infra is susceptible to 
'writes-above-media-capacity corner cases' (because of not handling 
device back-pressure notifications in time) whichever methods we use in 
the kernel. Even if we use write-controlled methods like ZSWAP and 
pro-active reclaims, there could be corner cases where the communication 
with the device could be broken and the write path is not aware of it 
immediately. Note that OCP spec [1] says the device should mark the 
memory location as 'poisoned' in 'over-capacity' writes.

So I have the following proposals / options for this scenario.

    Option 1: Poisoned data management - This is about accepting that 
poisoning of memory locations can happen in much more regular frequency 
here than regular memories and we need to figure out potential recovery 
mechanisms in host (not recovery of data; but recovery from the poison 
situation). But I guess folks will not be okay with it in general, and I 
am not aware of any workloads where data poisoning is tolerated (may be 
caching workloads?).

    Option 2 (preferred): Device assisted write budgeting - This is 
about a device aware / assisted mechanism for the write-controlled 
use-cases (Ex: ZSWAP) to know the 'safe number of  writes' that can be 
performed to the device (Or allows to be mapped writable in the page 
tables). This could be like a 'token bucket' algorithm, where the device 
provides a 'budget / set of tokens' to the host. And it need to be 
replenished periodically in the device communication code path; and if 
the host does not find the token, writes cannot go ahead.

In short, the communication with the device has to be maintained to make 
pages mapped writable. For MVP, this could be a simple constraint of 
checking actual device capacity periodically to replenish write-budget 
for CRAM. For other users of private nodes (GPU memory?), this 
constraint may not be needed at all.

We are planning to send an RFC code which will fit into your CRAM infra 
to discuss this poison management approach further.

[1]: 
https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf

~Arun George


Reply via email to