On Fri, May 22, 2026 at 02:10:34PM +0530, Arun George/Arun George wrote:
> Thanks.
> 
> On 05-05-2026 01:15 pm, Gregory Price wrote:
> > In the scenario i'm talking about, a "write budget" is defined as a
> > number of pages that are allows to be mapped writable in the page
> > tables at any given time.
> > Agree. I was also in the same context.
> 
> I am trying to bring the device perspective here, and would like to 
> discuss a few corner cases and possible solutions.
> 
> As I see, solving the compressed memory problem statement has these 
> aspects mainly:
> 
> 1) Allocation control: private/managed memory concept.
> 2) Write control: write-protected PTEs, write-controlled use cases like 
> ZSWAP
> 3) Proactive reclaims: optional methods to ease back-pressure using 
> memory shrinkers, ballooning, kswapd, promotion etc. These methods will 
> be triggered based on notifications/interrupts from the device.
> 
> May be they are not enough to cover some corner cases for cram!
> 
>   I believe that this thin-provisioned memory infra is susceptible to 

I'm not understanding the "thin provisioned" terminology you're using
here.  Can you help define what you mean by thin-provision in this case?

> 'writes-above-media-capacity corner cases' (because of not handling 
> device back-pressure notifications in time) whichever methods we use in 
> the kernel. Even if we use write-controlled methods like ZSWAP and 
> pro-active reclaims, there could be corner cases where the communication 
> with the device could be broken and the write path is not aware of it 
> immediately. Note that OCP spec [1] says the device should mark the 
> memory location as 'poisoned' in 'over-capacity' writes.
> 

The intent is to use the low-watermark to prevent new allocations from
occurring, and the write-controls prevent writing to the device without
interposition.

With a sufficient watermark such that the interrupt is delivered within
some number of microseconds, that should be perfectly fine to prevent
poison from ever occurring at all.

Since poison is only delivered *on read*, the system can go a long,
long time before poison is discovered. From the end-user perspective,
this poison is basically unacceptable.

So either we can prevent poison from always occurring, or the hardware
is not viable to support in a scaled production.  

If you think a sufficiently conservative watermark + write-protection is
insufficient to defend against poison, then please let me know why.

> So I have the following proposals / options for this scenario.
> 
>     Option 1: Poisoned data management - This is about accepting that 
> poisoning of memory locations can happen in much more regular frequency 
> here than regular memories and we need to figure out potential recovery 
> mechanisms in host (not recovery of data; but recovery from the poison 
> situation). But I guess folks will not be okay with it in general, and I 
> am not aware of any workloads where data poisoning is tolerated (may be 
> caching workloads?).
> 

Given option 1, I would never put such a device into my production
environment.  The only reasonable action for handling poison is killing
the software, as the data is functionally corrupted.

>     Option 2 (preferred): Device assisted write budgeting - This is 
> about a device aware / assisted mechanism for the write-controlled 
> use-cases (Ex: ZSWAP) to know the 'safe number of  writes' that can be 
> performed to the device (Or allows to be mapped writable in the page 
> tables). This could be like a 'token bucket' algorithm, where the device 
> provides a 'budget / set of tokens' to the host. And it need to be 
> replenished periodically in the device communication code path; and if 
> the host does not find the token, writes cannot go ahead.
> 

When I say budgeting, I mean literally a budget of writable pages,
entirely controlled by software (mm/cram.c or zswap.c or whatever).

This has nothing to do with device operation / throttling / bandwidth
budgets etc.  It is simply a proposal of an optimization that allows the
user to say:  X out of Y possible pages may be mapped writable.

I don't think this would be part of an initial MVP for a compressed ram
service (regardless of it's cram.c or zswap.c)

> In short, the communication with the device has to be maintained to make 
> pages mapped writable. For MVP, this could be a simple constraint of 
> checking actual device capacity periodically to replenish write-budget 
> for CRAM. For other users of private nodes (GPU memory?), this 
> constraint may not be needed at all.
> 
> We are planning to send an RFC code which will fit into your CRAM infra 
> to discuss this poison management approach further.
> 

I'll try to get a new version out this or next week, apologies for the
lag on this series, I've had a number of disruptions and major movements
on the patch set since I last updated it in February.

~Gregory

Reply via email to