Re: the XSK buffer pool needs be to reverted

Robin Murphy Wed, 01 Jul 2020 02:46:53 -0700

On 2020-06-30 20:08, Jonathan Lemon wrote:

On Mon, Jun 29, 2020 at 02:15:16PM +0100, Robin Murphy wrote:

On 2020-06-27 08:02, Christoph Hellwig wrote:

On Fri, Jun 26, 2020 at 01:54:12PM -0700, Jonathan Lemon wrote:

On Fri, Jun 26, 2020 at 09:47:25AM +0200, Christoph Hellwig wrote:


Note that this is somewhat urgent, as various of the APIs that the code
is abusing are slated to go away for Linux 5.9, so this addition comes
at a really bad time.


Could you elaborate on what is upcoming here?


Moving all these calls out of line, and adding a bypass flag to avoid
the indirect function call for IOMMUs in direct mapped mode.

Also, on a semi-related note, are there limitations on how many pages
can be left mapped by the iommu?  Some of the page pool work involves
leaving the pages mapped instead of constantly mapping/unmapping them.


There are, but I think for all modern IOMMUs they are so big that they
don't matter.  Maintaines of the individual IOMMU drivers might know
more.


Right - I don't know too much about older and more esoteric stuff like POWER
TCE, but for modern pagetable-based stuff like Intel VT-d, AMD-Vi, and Arm
SMMU, the only "limits" are such that legitimate DMA API use should never
get anywhere near them (you'd run out of RAM for actual buffers long
beforehand). The most vaguely-realistic concern might be a pathological
system topology where some old 32-bit PCI device doesn't have ACS isolation
from your high-performance NIC such that they have to share an address
space, where the NIC might happen to steal all the low addresses and prevent
the soundcard or whatever from being able to map a usable buffer.

With an IOMMU, you typically really *want* to keep a full working set's
worth of pages mapped, since dma_map/unmap are expensive while dma_sync is
somewhere between relatively cheap and free. With no IOMMU it makes no real
difference from the DMA API perspective since map/unmap are effectively no
more than the equivalent sync operations anyway (I'm assuming we're not
talking about the kind of constrained hardware that might need SWIOTLB).

On a heavily loaded box with iommu enabled, it seems that quite often
there is contention on the iova_lock.  Are there known issues in this
area?


I'll have to defer to the IOMMU maintainers, and for that you'll need
to say what code you are using.  Current mainlaine doesn't even have
an iova_lock anywhere.


Again I can't speak for non-mainstream stuff outside drivers/iommu, but it's
been over 4 years now since merging the initial scalability work for the
generic IOVA allocator there that focused on minimising lock contention, and
it's had considerable evaluation and tweaking since. But if we can achieve
the goal of efficiently recycling mapped buffers then we shouldn't need to
go anywhere near IOVA allocation either way except when expanding the pool.



I'm running a set of patches which uses the page pool to try and keep
all the RX buffers mapped as the skb goes up the stack, returning the
pages to the pool when the skb is freed.

On a dual-socket 12-core Intel machine (48 processors), and 256G of
memory, when iommu is enabled, I see the following from 'perf top -U',
as the hottest function being run:

-   43.42%  worker      [k] queued_spin_lock_slowpath
    - 43.42% queued_spin_lock_slowpath
       - 41.69% _raw_spin_lock_irqsave
          + 41.39% alloc_iova
          + 0.28% iova_magazine_free_pfns
       + 1.07% lock_sock_nested

Which likely is heavy contention on the iovad->iova_rbtree_lock.
(This is on a 5.6 based system, BTW).  More scripts and data are below.
Is there a way to reduce the contention here?

Hmm, how big are your DMA mappings? If you're still hitting the rbtree alot, that most likely implies that either you're making giant IOVAallocations that are too big to be cached, or you're allocating/freeingIOVAs in a pathological pattern that defeats the whole magazine cachemechanism (It's optimised for relatively-balanced allocation and freeingof sizes up order 6). On a further hunch, does the"intel_iommu=forcedac" option make any difference at all?

Either way if this persists after some initial warm-up period, itfurther implies that the page pool is not doing its job properly (or atleast in the way I would have expected). The alloc_iova() call is partof the dma_map_*() overhead, and if the aim is to keep pages mapped thenthat should only be called relatively infrequently. The optimalbehaviour would be to dma_map() new clean pages as they are added to thepool, use dma_sync() when they are claimed and returned by the driver,and only dma_unmap() if they're actually freed back to the pageallocator. And if you're still seeing a lot of dma_map/unmap time afterthat, then the pool itself is churning pages and clearly needs itssize/thresholds tuning.


Robin.




The following quick and dirty [and possibly wrong] .bpf script was used
to try and find the time spent in __alloc_and_insert_iova_range():

kprobe:alloc_iova_fast
{
         @fast = count();
}

kprobe:alloc_iova
{
         @iova_start[tid] = nsecs;
         @iova = count();
}

kretprobe:alloc_iova / @iova_start[tid] /
{
         @alloc_h = hist(nsecs - @iova_start[tid] - @mem_delta[tid]);
         delete(@iova_start[tid]);
         delete(@mem_delta[tid]);
}

kprobe:alloc_iova_mem / @iova_start[tid] /
{
         @mem_start[tid] = nsecs;
}

kretprobe:alloc_iova_mem / @mem_start[tid] /
{
         @mem_delta[tid] = nsecs - @mem_start[tid];
         delete(@mem_start[tid]);
}

kprobe:iova_insert_rbtree / @iova_start[tid] /
{
         @rb_start[tid] = nsecs;
         @rbtree = count();
}

kretprobe:iova_insert_rbtree / @rb_start[tid] /
{
         @insert_h = hist(nsecs - @rb_start[tid]);
         delete(@rb_start[tid]);
}

interval:s:2
{
         print(@fast);
         print(@iova);
         print(@rbtree);
         print(@alloc_h);
         print(@insert_h);
         printf("--------\n");
}

I see the following results.

@fast: 1989223
@iova: 725269
@rbtree: 689306

@alloc_h:
[64, 128)              2 |                                                    |
[128, 256)           118 |                                                    |
[256, 512)           983 |                                                    |
[512, 1K)           3816 |@@                                                  |
[1K, 2K)           10557 |@@@@@@                                              |
[2K, 4K)           19540 |@@@@@@@@@@@@                                        |
[4K, 8K)           31294 |@@@@@@@@@@@@@@@@@@@                                 |
[8K, 16K)          38112 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[16K, 32K)         46948 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[32K, 64K)         69728 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[64K, 128K)        83797 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128K, 256K)       84317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K)       82962 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[512K, 1M)         72751 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
[1M, 2M)           49191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[2M, 4M)           26591 |@@@@@@@@@@@@@@@@                                    |
[4M, 8M)           15559 |@@@@@@@@@                                           |
[8M, 16M)          12283 |@@@@@@@                                             |
[16M, 32M)         18266 |@@@@@@@@@@@                                         |
[32M, 64M)         22539 |@@@@@@@@@@@@@                                       |
[64M, 128M)         3005 |@                                                   |
[128M, 256M)          41 |                                                    |
[256M, 512M)           0 |                                                    |
[512M, 1G)             0 |                                                    |
[1G, 2G)               0 |                                                    |
[2G, 4G)             101 |                                                    |

@insert_h:
[128, 256)          2380 |                                                    |
[256, 512)         70043 |@@@@@@@@                                            |
[512, 1K)         431263 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)          182804 |@@@@@@@@@@@@@@@@@@@@@@                              |
[2K, 4K)            2742 |                                                    |
[4K, 8K)              43 |                                                    |
[8K, 16K)             25 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           6 |                                                    |

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: the XSK buffer pool needs be to reverted

Reply via email to