On Sat, Jan 07, 2006 at 12:36:25AM -0800, David S. Miller wrote:
> From: Eric Dumazet <[EMAIL PROTECTED]>
> Date: Sat, 07 Jan 2006 08:53:52 +0100
> 
> > I have no problem with this, since the biggest server I have is 4
> > way, but are you sure big machines wont suffer from this single
> > spinlock ?
> 
> It is the main question.
> 
> > Also I dont understand what you want to do after this single
> > spinlock patch.  How is it supposed to help the 'ip route flush
> > cache' problem ?  In my case, I have about 600.000 dst-entries :
> 
> I don't claim to have a solution to this problem currently.
> 
> Doing RCU and going through the whole DST GC machinery is overkill for
> an active system.  So, perhaps a very simple solution will do:
> 
> 1) On rt_run_flush(), do not rt_free(), instead collect all active
>    routing cache entries onto a global list, begin a timer to
>    fire in 10 seconds (or some sysctl configurable amount).
> 
> 2) When a new routing cache entry is needed, check the global
>    list appended to in #1 above first, failing that do dst_alloc()
>    as is done currently.
> 
> 3) If timer expires, rt_free() any entries in the global list.
> 
> The missing trick is how to ensure RCU semantics when reallocating
> from the global list.

The straightforward ways of doing this require a per-entry lock in
addition to the dst_entry reference count -- lots of read-side overhead.

More complex approaches use a generation number that is incremented
when adding to or removing from the global list.  When the generation
number overflows, unconditionally rt_free() it rather than adding
to the global list again.  Then there needs to be some clever code
on the read side to detect the case when the generation number
changes while acquiring a reference.  And memory barriers.  Also
lots of read-side overhead.  Also, it is now -always- necessary to
acquire a reference on the read-side.

> The idea is that an active system will immediately repopulate itself
> with all of these entries just flushed from the table.
> 
> RCU really doesn't handle this kind of problem very well.  It truly
> excels when work is generated by process context work, not interrupt
> work.

Sounds like a challenge to me.  ;-)

Well, one possible way to attack Eric's workload might be the following:

o       Size the hash table to strike the appropriate balance between
        read-side search overhead and memory consumption.  Call the
        number of hash-chain headers N.

o       Create a hashed array of locks sized to allow the update to
        proceed sufficiently quickly.  Call the number of locks M,
        probably a power of two.  This means that M CPUs can be doing
        the update in parallel.

o       Create an array of M^2 list headers (call it xfer[][]), but
        since this is only needed during an update, it can be allocated
        and deallocated if need be.  (Me, with my big-server experience,
        would probably just create the array, since M is not likely to
        be too large.  But your mileage may vary.  And you really only
        need M*(M-1) list headers, but that makes the index calculation
        a bit more annoying.)

o       Use a two-phase update.  In the first phase, each updating
        CPU acquires the corresponding lock and removes entries from
        the corresponding partition of the hash table.  If the new
        location of a given entry falls into the same partition, it
        is added back to the appropriate hash chain of that partition.
        Otherwise, add the entry to xfer[dst][src], where "src" and 
        "dst" are indexes of the corresponding partitions.

o       When all CPUs finish removing entries from their partition,
        they check into a barrier.  Once all have checked in, they
        can start the second phase of the update.

o       In the second phase, each CPU removes the entries from the
        xfer array that are destined for its partition and adds them
        to the hash chain that they are destined for.

Some commentary and variations, in the hope that this inspires someone
to come up with an even better idea:

o       Unless M is at least three, there is no performance gain
        over a single global lock with a single CPU doing the update,
        since each element must now undergo four list operations rather
        than just two.

o       The xfer[][] array must have each entry cache-aligned, or
        you lose big on cacheline effects.  Note that it is -not-
        sufficient to simply align the rows or the columns, since
        each CPU has its own column when inserting and its own
        row when removing from xfer[][].

o       And the data-skew effects are less severe if this procedure
        runs from process context.  A spinning barrier must be used
        otherwise.  But note that the per-partition locks could remain
        spinlocks, only the barrier need involve sleeping (in case
        that helps, am getting a bit ahead of my understanding of
        this part of the kernel).

        So I half-agree with Dave -- this works better if the bulk
        update is done in process context, however, updates involving
        single entries being individually added and removed from the
        hash table can be done from interrupt context.

o       One (more complex!) way to reduce the effect of data skew to
        partition the array based on the number of elements in each
        partition rather than by the number of chains, but this would
        require a second barrier that is completed before dropping locks
        in order to avoid messing up concurrent single-element updates.

        And this only handles the phase-1 data-skew effects. It is
        possible to handle phase-2 data skew as well, but it takes an
        up-front full-table scan and so it is not clear how it could
        possibly be a performance win.

o       Note that routing slows down significantly while an update is
        in progress, because on average only half of the entries are
        in the hash table during the update.  I have no idea whether
        or not this is acceptable.  I am assuming that the same
        mechanism that prevents two concurrent route-cache misses
        from inserting two entries from the same route would also
        prevent adding a new entry when one was already in the
        xfer array.

Thoughts?

                                                        Thanx, Paul
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to