On 10/20/2015 10:50 AM, Matan Barak wrote:
> 
> 
> On 10/19/2015 6:27 PM, Doug Ledford wrote:
>> On 10/19/2015 10:20 AM, Matan Barak wrote:
>>> On 10/19/2015 3:23 PM, Doug Ledford wrote:
>>
>>>> This is a major cause of the slowness.  Unless you have a specific need
>>>> of them, per-entry rwlocks are *NOT* a good idea.  I was going to bring
>>>> this up separately, so I'll just mention it here.  Per entry locks help
>>>> reduce contention when you have lots of multiple accessors.  Using
>>>> rwlocks help reduce contention when you have a read-mostly entry
>>>> that is
>>>> only occasionally changed.  But every lock and every unlock (just like
>>>> every atomic access) still requires a locked memory cycle.  That means
>>>> every lock acquire and every lock release requires a synchronization
>>>> event between all CPUs.  Using per-entry locks means that every entry
>>>> triggers two of those synchronization events.  On modern Intel CPUs,
>>>> they cost about 32 cycles per event.  If you are going to do something,
>>>> and it can't be done without a lock, then grab a single lock and do it
>>>> as fast as you can.  Only in rare cases would you want per-entry locks.
>>>>
>>>
>>> I agree that every rwlock costs us locked access. However, lets look at
>>> the common scenario here.
>>
>> No, let's not.  Especially if your "common scenario" causes you to
>> ignore pathological bad behavior.  Optimizing for the common case is not
>> an excuse to enable pathological behavior.
>>
>>> I think that in production stable systems, the
>>> IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable.
>>
>> Probably true.
>>
>>> Thus, most accesses should be read calls.
>>
>> Probably true.
>>
>>> That's why IMHO read-write
>>> access makes sense here.
>>
>> Probably true.
>>
>>> Regarding single-lock vs per entry lock, it really depends on common an
>>> entry could be updated while another entry is being used by an
>>> application.
>>
>> No, it depends on more than that.  In addition to the frequency of
>> updates versus lookups, there is also the issue of lookup cost and
>> update cost.  Using per entry locks makes both the lookup cost and the
>> update cost of anything other than the first gid in the table grow
>> exponentially.  A good way to demonstrate this would be to create 100
>> vlans on a RoCE device, then run the cmtime test between two hosts using
>> the 100th vlan IP at each end.  Using per-entry locks, the performance
>> of this test relative to using the first IP on the device will be
>> pathologically bad.
>>
> 
> That's correct, we"waste" #entries * <time of lock and unlock>. But we
> have to note this is mainly during the connecting part.

Yes, and I spent a *month* of my life tracking down why a customer's
real world application was failing in production (but not in their
staging environment) and it all boiled down to the fact that when their
app had to start up more than about 700-900 connections at app startup
and their setup included dual port controllers with both ports in use,
and some of those connections had to look up GIDs on the second port,
the abysmal per-gid lookup rate meant that the kernel had a hard ceiling
of how many GIDs it could look up per second.  Once that was hit, the
app went into a permanent cycle of dropping failed connections and
re-opening new ones and it would simply never catch up.

So it concerns me greatly to hear you say "But we have to note this is
mainly during the connecting part", implying that some wasted cycles on
connection startup are OK.  They aren't.  Not ever.  Production systems
fall over dead precisely because of things like this.

>> 3)  Consider optimizing the code by requiring that any time a GID is
>> added, it must take the first empty spot, and any time one is deleted,
>> any valid entries after it must be moved up to fill the empty spot, then
>> optimize find_gid to quit once it finds an empty slot because it knows
>> the rest of the table is empty.  Given that mlx4 keeps a huge 128 table
>> array, this can be very helpful.
> 
> I don't think we want to change existing GID entries. Entities sometimes
> want to refer to GIDs via GID indices. Changing the GIDs order under
> their feet could be problematic.

I had a feeling this one might be an issue.  There is, however, a
solution that would resolve the problem.  The issue here is that you
have an array of GIDs that you want to be able to both scan efficiently
and access directly.  The array is great for direct access, but the scan
efficiently part is a total looser because you scan so many empty
entries most of the time.  You could change it so that a gid port table
is actually a combination of 1) an array of gid_port_table_element
pointers and 2) a linked list of gid_port_table_element structs.  Then
you only allocate a struct when needed, link it into the list, assign it
an index, set the pointer in the array at point index to point to your
struct, and that way you can still direct access the elements, but can
also scan efficiently.  It's more complex, but would speed up the
scanning considerably.

>> 4)  Does the table port struct have to use a mutex?  Could it be changed
>> to a rwlock instead?  If so, you could changing write_gid so that it
>> assumes it is called with the read lock on the port already held and it
>> merely updates itself to a write lock when it is ready to write the
>> entry to the array and update the card.  It would then downgrade back to
>> a read lock on return and the calling function would hold the single
>> read lock from the time it is ready to call find_gid until write_gid has
>> returned.
> 
> AFAIK, read_locks can't be "upgraded" to write_locks. In addition, the
> cost of the locks here is negligible (assuming it's a per-table rwlock
> and not per entry lock). In all code paths that uses the table mutex, we
> call the vendor's command to write the GID table. Updating the GID table
> usually cost a lot more than the mutex locks.

I thought the kernel allowed upgrades, but I admit I didn't check.
Anyway, it was merely to prevent the case where you drop the read lock
and try to take the write lock and other readers rush in.  You could
also take the write lock from the beginning, but that might hurt
concurrency.

>> 5)  I noticed that ib_cache_gid_find_by_port calls find_gid without even
>> taking the read lock, that needs fixed.
> 
> find_gid takes a per-entry read_lock. Which lock do you think is missing
> here?

The other places hold the table mutex.

> Although it's mostly used in the connection part,

Please don't say that as though it makes the inefficiencies in this code
OK.  It doesn't.  Being on the connection startup or teardown path is
not an excuse.

> I think the locking
> refactoring here is a good performance enhancement.
> Because it is an enhancement, do you agree we could work here by gradual
> changes and change the locking schema later?

I never intended it to be during the rc phase.  But it would be good to
tackle soon.


-- 
Doug Ledford <[email protected]>
              GPG KeyID: 0E572FDD


Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to