On 20/02/2020 6:38 am, [email protected] wrote:
On 2020-02-17 08:03, Robin Murphy wrote:
On 14/02/2020 11:06 pm, Isaac J. Manjarres wrote:
From: Liam Mark <[email protected]>

Using the best-fit algorithm, instead of the first-fit
algorithm, may reduce fragmentation when allocating
IOVAs.

What kind of pathological allocation patterns make that a serious
problem? Is there any scope for simply changing the order of things in
the callers? Do these drivers also run under other DMA API backends
(e.g. 32-bit Arm)?

The usecases where the IOVA space has been fragmented have non-deterministic allocation patterns, and thus, it's not feasible to change the allocation order to avoid fragmenting
the IOVA space.

What about combining smaller buffers into larger individual allocations; any scope for that sort of thing? Certainly if you're consistently allocating small things less than PAGE_SIZE then DMA pools would be useful to avoid wanton memory wastage in general.

 From what we've observed, the usecases involve allocations of two types of
buffers: one type of buffer between 1 KB to 4 MB in size, and another type of
buffer between 1 KB to 400 MB in size.

The pathological scenarios seem to arise when there are
many (100+) randomly distributed non-power of two allocations, which in some cases leaves
behind holes of up to 100+ MB in the IOVA space.

Here are some examples that show the state of the IOVA space under which failure to
allocate an IOVA was observed:

Instance 1:
     Currently mapped total size : ~1.3GB
     Free space available : ~2GB
     Map for ~162MB fails.
         Max contiguous space available : < 162MB

Instance 2:
     Currently mapped total size : ~950MB
     Free space available : ~2.3GB
     Map for ~320MB fails.
     Max contiguous space available : ~189MB

Instance 3:
     Currently mapped total size : ~1.2GB
     Free space available : ~2.7GB
     Map for ~162MB fails.
     Max contiguous space available : <162MB

We are still in the process of collecting data with the best-fit algorithm enabled
to provide some numbers to show that it results in less IOVA space
fragmentation.

Thanks for those examples, and I'd definitely like to see the comparative figures. To dig a bit further, at the point where things start failing, where are the cached nodes pointing? IIRC there is still a pathological condition where empty space between limit_pfn and cached32_node gets 'lost' if nothing in between is freed, so the bigger the range of allocation sizes, the worse the effect, e.g.:

(considering an empty domain, pfn 0 *not* reserved, 32-bit limit_pfn)

        alloc 4K, succeeds, cached32_node now at 4G-4K
        alloc 2G, succeeds, cached32_node now at 0
        alloc 4K, fails despite almost 2G contiguous free space within limit_pfn
(and max32_alloc_size==1 now fast-forwards *any* further allocation attempt to failure)

If you're falling foul of this case (I was never sure how realistic a problem it would be in practice), there are at least a couple of much less invasive tweaks I can think of that would be worth exploring.

To answer your question about whether if this driver run under other DMA API backends:
yes, such as 32 bit ARM.

OK, that's what I suspected :)

AFAICS arch/arm's __alloc_iova() is also a first-fit algorithm, so if you get better behaviour there then it would suggest that this aspect isn't really the most important issue. Certainly, the fact that the "best fit" logic here also happens to ignore the cached nodes does start drawing me back to the point above.

Robin.
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to