I wrote:
> The way the attached patch attacks this is for the shared-lock access
> case to simply set the page's LRU counter to zero, without bumping up
> the LRU counters of the other pages as the normal adjustment would do.
> ...
> I'm not totally happy with this heuristic, though, and was
> wondering if anyone had a better idea.  Anyone seen a lock-free
> data structure for LRU or approximately-LRU state tracking?

I've come up with what seems a slightly better way to do this.  The idea
is to replace the current structure for tracking which page is the least-
recently-used with this:

typedef struct SlruSharedData
{
    ...

    /*----------
     * We mark a page "most recently used" by setting
     *        page_lru_count[slotno] = ++cur_lru_count;
     * The oldest page is therefore the one with the highest value of
     *        cur_lru_count - page_lru_count[slotno]
     *----------
     */
    int            cur_lru_count;

    ...

    int            page_lru_count[NUM_SLRU_BUFFERS];

    ...

which makes the SlruRecentlyUsed macro look like

#define SlruRecentlyUsed(shared, slotno) \
    do { \
        int        new_lru_count = (shared)->cur_lru_count; \
        if (new_lru_count != (shared)->page_lru_count[slotno]) { \
            (shared)->cur_lru_count = ++new_lru_count; \
            (shared)->page_lru_count[slotno] = new_lru_count; \
        } \
    } while (0)

and SlruSelectLRUPage() has to look for the maximum value of
"cur_count - shared->page_lru_count[slotno]" rather than just
"shared->page_lru_count[slotno]" as before.  This seems like a win
in any case since it takes cycles out of the commonly used path at
a small increase in the cost of SlruSelectLRUPage --- but in that
routine you are about to do I/O anyway, so a few extra subtractions
are negligible.

However, the real reason for doing this is that I think it's OK for
the proposed SimpleLruReadPage_ReadOnly routine to apply this form
of SlruRecentlyUsed even though it holds only a shared lock.  Assuming
that int reads and writes are atomic, the only possible failures are

1. If a process running the macro is delayed, it might store a stale
(hence too small) value back into cur_lru_count or a page_lru_count
array element after someone else has incremented them to a larger value.

2. Two processes might read the same cur_lru_count value at the same
time, so that one of their increments is lost.  This has the same end
effect as #1, though.

Given reasonable care in SlruSelectLRUPage (see attached excerpt), we
can defend against these scenarios and usually still make a reasonable
choice of which page to evict.  In any case, the worst possible scenario
is that we make a nonoptimal choice of page to evict due to confused
lru_count state.  This price seems well worth the chance to reduce
contention for shared memory.

Thoughts, objections?

                        regards, tom lane


        /*
         * If we find any EMPTY slot, just select that one. Else locate the
         * least-recently-used slot to replace.
         *
         * Normally the page_lru_count values will all be different and so
         * there will be a well-defined LRU page.  But since we allow
         * concurrent execution of SlruRecentlyUsed() within
         * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
         * acquire the same lru_count values.  In that case we break ties by
         * choosing the furthest-back page.
         *
         * In no case will we select the slot containing latest_page_number
         * for replacement, even if it appears least recently used.
         *
         * Notice that this next line forcibly advances cur_lru_count to a
         * value that is certainly beyond any value that will be in the
         * page_lru_count array after the loop finishes.  This ensures that
         * the next execution of SlruRecentlyUsed will give us good data,
         * even if it's against a page that has the current counter value.
         */
        cur_count = (shared->cur_lru_count)++;
        best_delta = -1;
        bestslot = 0;            /* no-op, just keeps compiler quiet */
        best_page_number = 0;    /* ditto */
        for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
        {
            int            this_delta;
            int            this_page_number;

            if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
                return slotno;
            this_delta = cur_count - shared->page_lru_count[slotno];
            if (this_delta < 0)
            {
                /*
                 * Clean up in case shared updates have caused cur_count
                 * increments to get "lost".  We back off the page counts,
                 * rather than trying to increase cur_count, to avoid any
                 * question of infinite loops or failure in the presence of
                 * wrapped-around counts.
                 */
                shared->page_lru_count[slotno] = cur_count;
                this_delta = 0;
            }
            this_page_number = shared->page_number[slotno];
            if ((this_delta > best_delta ||
                 (this_delta == best_delta &&
                  ctl->PagePrecedes(this_page_number, best_page_number))) &&
                this_page_number != shared->latest_page_number)
            {
                bestslot = slotno;
                best_delta = this_delta;
                best_page_number = this_page_number;
            }
        }

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Reply via email to