Hi Paul,

I've been reading Chapter 5 of perfbook and thoroughly enjoying the
deep dive into counting algorithms. However, while studying the
`count_lim` approach in Section 5.3.3 in the context of the
structure-allocation limit problem from Quick Quiz 5.3, I noticed what
seems to be a mismatch between the stated requirement and the actual
problem the code solves best.

It appears to me that the `count_lim` implementation is perfectly
tailored for a Producer-Consumer model, rather than an Allocate-Free
model. To visualize the specific Producer-Consumer requirement it
solves, I picture each thread as a "direct-sales factory": Once a
local factory produces enough goods to hit its local capacity, it
ships the products to a central warehouse, dynamically adjusts its
local limit based on the central warehouse's remaining capacity, and
retains half for local sales. If a factory receives an order and local
inventory is insufficient, it pulls products from the central
warehouse. If the central warehouse is also empty, the order simply
fails. Even if other local factories might have surplus goods, the
model dictates it is better to let the consumer wait for new local
production rather than undertaking the expensive task of pulling
inventory from all other local warehouses.

This model is extremely valuable in real-world scenarios, such as
generating TLS Session Tickets. In that scenario, tickets must be
strictly unique (use-and-destroy, never returned back) to prevent
replay attacks, and generating them is CPU-intensive, requiring
background production limits to prevent hoarding. `count_lim` fits
this perfectly.

However, this requirement differs significantly from the
"Allocate-Free" model implied in Quick Quiz 5.3. I've summarized their
differences here:

| Aspect | Producer-Consumer Model | Allocate-Free Model |
|  ----  |  ----  | ----  |
| `add_count` Semantics | Produces a resource out of thin air
(consuming CPU/time) and adds it to inventory. | Acquires a resource
from a strictly finite global pool. |
| `sub_count` Semantics | Consumes the resource completely (it leaves
the system). | Returns the resource back to the global pool. |
| Meaning of the Limit | Prevents endless over-production/hoarding
when there are no consumers. | A hard physical limit; fails when the
pool is completely empty. |
| `sub_count` Failure | Necessary: If production lags, inventory is
empty, and consumers must wait/fail. | Conceptually unnecessary:
Assuming correct program logic (no double-frees), returning an
acquired resource back to the pool shouldn't fail. |

The count_lim approach works brilliantly for Producer-Consumer
scenarios. However, applying it to an Allocate-Free scenario
introduces a critical issue: cross-thread freeing (as discussed in
Quick Quiz 5.35) can easily cause a legitimate `sub_count` to fail. In
an Allocate-Free model, a resource return failing due to counter
implementation details (rather than actual logical bugs like
double-free) is highly counterintuitive. While Section 5.4's exact
limit counter solves this, it does so at a massive performance cost.
Sections 5.3.3 and 5.3.4, unfortunately, drop these legitimate
cross-thread frees.

Assuming the caller enforces strict lifecycle correctness (no
double-frees or unallocated-frees), I experimented with two
alternative implementations for the Allocate-Free model that ensure
sub_count never fails, while maintaining performance on par with
Sections 5.3.3 and 5.3.4:

- Signed Quota Counter: Converting the counters to signed integers,
allowing sub_count to drive the local counter into negative values
(representing pending returns), and balancing with the global counter
only at a negative threshold.

- Available Capacity Tracking: Inverting the logic to track "free
space" instead of "used space" (similar to a semaphore). Here,
add_count (allocation) decrements the available quota, and sub_count
(freeing) increments it.

Have you ever considered framing the `count_lim` examples explicitly
around the Producer-Consumer pattern, or explored using
signed/capacity-tracking counters to achieve zero-fail fast-path
freeing for resource management scenarios?

I would love to hear your thoughts on this distinction.

                                                        Best regards,
npc1054657282

---------------------------------------------------------
Appendix 1: Signed Quota Implementation
---------------------------------------------------------
static __inline__ void globalize_count(void);
static __inline__ void balance_count(void);

long __thread counter = 0;
long __thread countermax = 0;
// long __thread countermin = 0;
long globalcountmax = 10000;
long globalcount = 0;
long globalreserve = 0;
long *counterp[NR_THREADS] = { NULL };
DEFINE_SPINLOCK(gblcnt_mutex);

static __inline__ int add_count(unsigned short delta)
{
    if (countermax - counter >= (long)delta) {
        WRITE_ONCE(counter, counter + delta);
        return 1;
    }
    spin_lock(&gblcnt_mutex);
    globalize_count();
    if (globalcountmax -
        globalcount - globalreserve < (long)delta) {
        spin_unlock(&gblcnt_mutex);
        return 0;
    }
    globalcount += delta;
    balance_count();
    spin_unlock(&gblcnt_mutex);
    return 1;
}

static __inline__ int sub_count(unsigned short delta)
{
    // if (counter - countermin >= (long)delta) {
    if (counter + countermax >= (long)delta) {
        WRITE_ONCE(counter, counter - delta);
        return 1;
    }
    spin_lock(&gblcnt_mutex);
    globalize_count();
    globalcount -= delta;
    balance_count();
    spin_unlock(&gblcnt_mutex);
    return 1;
}

static __inline__ unsigned long read_count(void)
{
    int t;
    long sum;

    spin_lock(&gblcnt_mutex);
    sum = globalcount;
    for_each_thread(t) {
        if (counterp[t] != NULL)
            sum += READ_ONCE(*counterp[t]);
    }
    spin_unlock(&gblcnt_mutex);
    return (unsigned long)sum;
}

static __inline__ void globalize_count(void)
{
    globalcount += counter;
    counter = 0;
    globalreserve -= countermax;
    countermax = 0;
}

static __inline__ void balance_count(void)
{
    countermax = globalcountmax -
                 globalcount - globalreserve;
    countermax /= 2 * num_online_threads();
    globalreserve += countermax;
    // countermin = - countermax;
}

---------------------------------------------------------
Appendix 2: Available Capacity Tracking
---------------------------------------------------------
static __inline__ void globalize_count(void);
static __inline__ void balance_count(void);

unsigned long __thread counter = 0;
unsigned long __thread countermax = 0;
const unsigned long globalcountmax = 10000;
unsigned long globalcount = globalcountmax;
unsigned long *counterp[NR_THREADS] = { NULL };
DEFINE_SPINLOCK(gblcnt_mutex);

static __inline__ int add_count(unsigned long delta)
{
    if (counter >= delta) {
        WRITE_ONCE(counter, counter - delta);
        return 1;
    }
    spin_lock(&gblcnt_mutex);
    globalize_count();
    if (globalcount < delta) {
        spin_unlock(&gblcnt_mutex);
        return 0;
    }
    globalcount -= delta;
    balance_count();
    spin_unlock(&gblcnt_mutex);
    return 1;
}

static __inline__ int sub_count(unsigned long delta)
{
    if (countermax - counter >= delta) {
        WRITE_ONCE(counter, counter + delta);
        return 1;
    }
    spin_lock(&gblcnt_mutex);
    globalize_count();
    globalcount += delta;
    balance_count();
    spin_unlock(&gblcnt_mutex);
    return 1;
}

static __inline__ unsigned long read_count(void)
{
    int t;
    unsigned long sum;

    spin_lock(&gblcnt_mutex);
    sum = globalcount;
    for_each_thread(t) {
        if (counterp[t] != NULL)
            sum += READ_ONCE(*counterp[t]);
    }
    spin_unlock(&gblcnt_mutex);
    return globalcountmax - sum;
}

static __inline__ void globalize_count(void)
{
    globalcount += counter;
    counter = 0;
    countermax = 0;
}

static __inline__ void balance_count(void)
{
    countermax = globalcount;
    countermax /= num_online_threads();
    counter = countermax / 2;
    globalcount -= counter;
}

Reply via email to