Dear John:
    Thank you for your reply. The case you mentioned is a typical
performance regression issue, there's no need for the kernel to oom kill
any random process even in the worst case. But in my observations, the
iommu_iova slab could consume up to 40G memory, and the kernel have to kill
my vm process to free memory (64G memory installed). So I don't think it's
relevent.


John Garry <john.ga...@huawei.com> 于2020年4月25日周六 上午1:50写道:

> On 24/04/2020 17:30, Robin Murphy wrote:
> > On 2020-04-24 2:20 pm, Bin wrote:
> >> Dear Robin:
> >>       Thank you for your explanation. Now, I understand that this could
> be
> >> NIC driver's fault, but how could I confirm it? Do I have to debug the
> >> driver myself?
> >
> > I'd start with CONFIG_DMA_API_DEBUG - of course it will chew through
> > memory about an order of magnitude faster than the IOVAs alone, but it
> > should shed some light on whether DMA API usage looks suspicious, and
> > dumping the mappings should help track down the responsible driver(s).
> > Although the debugfs code doesn't show the stacktrace of where each
> > mapping was made, I guess it would be fairly simple to tweak that for a
> > quick way to narrow down where to start looking in an offending driver.
> >
> > Robin.
>
> Just mentioning this in case it's relevant - we found long term aging
> throughput test causes RB tree to grow very large (and would I assume
> eat lots of memory):
>
>
> https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leiz...@huawei.com/
>
> John
>
> >
> >> Robin Murphy <robin.mur...@arm.com> 于2020年4月24日周五 下午8:15写道:
> >>
> >>> On 2020-04-24 1:06 pm, Bin wrote:
> >>>> I'm not familiar with the mmu stuff, so what you mean by "some driver
> >>>> leaking DMA mappings", is it possible that some other kernel module
> like
> >>>> KVM or NIC driver leads to the leaking problem instead of the iommu
> >>> module
> >>>> itself?
> >>>
> >>> Yes - I doubt that intel-iommu itself is failing to free IOVAs when it
> >>> should, since I'd expect a lot of people to have noticed that. It's far
> >>> more likely that some driver is failing to call dma_unmap_* when it's
> >>> finished with a buffer - with the IOMMU disabled that would be a no-op
> >>> on x86 with a modern 64-bit-capable device, so such a latent bug could
> >>> have been easily overlooked.
> >>>
> >>> Robin.
> >>>
> >>>> Bin <anole1...@gmail.com> 于 2020年4月24日周五 20:00写道:
> >>>>
> >>>>> Well, that's the problem! I'm assuming the iommu kernel module is
> >>> leaking
> >>>>> memory. But I don't know why and how.
> >>>>>
> >>>>> Do you have any idea about it? Or any further information is needed?
> >>>>>
> >>>>> Robin Murphy <robin.mur...@arm.com> 于 2020年4月24日周五 19:20写道:
> >>>>>
> >>>>>> On 2020-04-24 1:40 am, Bin wrote:
> >>>>>>> Hello? anyone there?
> >>>>>>>
> >>>>>>> Bin <anole1...@gmail.com> 于2020年4月23日周四 下午5:14写道:
> >>>>>>>
> >>>>>>>> Forget to mention, I've already disabled the slab merge, so this
> is
> >>>>>> what
> >>>>>>>> it is.
> >>>>>>>>
> >>>>>>>> Bin <anole1...@gmail.com> 于2020年4月23日周四 下午5:11写道:
> >>>>>>>>
> >>>>>>>>> Hey, guys:
> >>>>>>>>>
> >>>>>>>>> I'm running a batch of CoreOS boxes, the lsb_release is:
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>> # cat /etc/lsb-release
> >>>>>>>>> DISTRIB_ID="Container Linux by CoreOS"
> >>>>>>>>> DISTRIB_RELEASE=2303.3.0
> >>>>>>>>> DISTRIB_CODENAME="Rhyolite"
> >>>>>>>>> DISTRIB_DESCRIPTION="Container Linux by CoreOS 2303.3.0
> (Rhyolite)"
> >>>>>>>>> ```
> >>>>>>>>>
> >>>>>>>>> ```
> >>>>>>>>> # uname -a
> >>>>>>>>> Linux cloud-worker-25 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38
> -00
> >>>>>> 2019
> >>>>>>>>> x86_64 Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz GenuineIntel
> >>>>>> GNU/Linux
> >>>>>>>>> ```
> >>>>>>>>> Recently, I found my vms constently being killed due to OOM, and
> >>> after
> >>>>>>>>> digging into the problem, I finally realized that the kernel is
> >>>>>> leaking
> >>>>>>>>> memory.
> >>>>>>>>>
> >>>>>>>>> Here's my slabinfo:
> >>>>>>>>>
> >>>>>>>>>      Active / Total Objects (% used)    : 83818306 / 84191607
> (99.6%)
> >>>>>>>>>      Active / Total Slabs (% used)      : 1336293 / 1336293
> (100.0%)
> >>>>>>>>>      Active / Total Caches (% used)     : 152 / 217 (70.0%)
> >>>>>>>>>      Active / Total Size (% used)       : 5828768.08K /
> 5996848.72K
> >>>>>> (97.2%)
> >>>>>>>>>      Minimum / Average / Maximum Object : 0.01K / 0.07K / 23.25K
> >>>>>>>>>
> >>>>>>>>>       OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
> >>>>>>>>>
> >>>>>>>>> 80253888 80253888 100%    0.06K 1253967       64   5015868K
> >>> iommu_iova
> >>>>>>
> >>>>>> Do you really have a peak demand of ~80 million simultaneous DMA
> >>>>>> buffers, or is some driver leaking DMA mappings?
> >>>>>>
> >>>>>> Robin.
> >>>>>>
> >>>>>>>>> 489472 489123  99%    0.03K   3824      128     15296K kmalloc-32
> >>>>>>>>>
> >>>>>>>>> 297444 271112  91%    0.19K   7082       42     56656K dentry
> >>>>>>>>>
> >>>>>>>>> 254400 252784  99%    0.06K   3975       64     15900K
> >>> anon_vma_chain
> >>>>>>>>>
> >>>>>>>>> 222528  39255  17%    0.50K   6954       32    111264K
> kmalloc-512
> >>>>>>>>>
> >>>>>>>>> 202482 201814  99%    0.19K   4821       42     38568K
> >>> vm_area_struct
> >>>>>>>>>
> >>>>>>>>> 200192 200192 100%    0.01K    391      512      1564K kmalloc-8
> >>>>>>>>>
> >>>>>>>>> 170528 169359  99%    0.25K   5329       32     42632K filp
> >>>>>>>>>
> >>>>>>>>> 158144 153508  97%    0.06K   2471       64      9884K kmalloc-64
> >>>>>>>>>
> >>>>>>>>> 149914 149365  99%    0.09K   3259       46     13036K anon_vma
> >>>>>>>>>
> >>>>>>>>> 146640 143123  97%    0.10K   3760       39     15040K
> buffer_head
> >>>>>>>>>
> >>>>>>>>> 130368  32791  25%    0.09K   3104       42     12416K kmalloc-96
> >>>>>>>>>
> >>>>>>>>> 129752 129752 100%    0.07K   2317       56      9268K
> Acpi-Operand
> >>>>>>>>>
> >>>>>>>>> 105468 105106  99%    0.04K   1034      102      4136K
> >>>>>>>>> selinux_inode_security
> >>>>>>>>>      73080  73080 100%    0.13K   2436       30      9744K
> >>>>>> kernfs_node_cache
> >>>>>>>>>
> >>>>>>>>>      72360  70261  97%    0.59K   1340       54     42880K
> inode_cache
> >>>>>>>>>
> >>>>>>>>>      71040  71040 100%    0.12K   2220       32      8880K
> >>> eventpoll_epi
> >>>>>>>>>
> >>>>>>>>>      68096  59262  87%    0.02K    266      256      1064K
> kmalloc-16
> >>>>>>>>>
> >>>>>>>>>      53652  53652 100%    0.04K    526      102      2104K
> pde_opener
> >>>>>>>>>
> >>>>>>>>>      50496  31654  62%    2.00K   3156       16    100992K
> >>> kmalloc-2048
> >>>>>>>>>
> >>>>>>>>>      46242  46242 100%    0.19K   1101       42      8808K
> cred_jar
> >>>>>>>>>
> >>>>>>>>>      44496  43013  96%    0.66K    927       48     29664K
> >>>>>> proc_inode_cache
> >>>>>>>>>
> >>>>>>>>>      44352  44352 100%    0.06K    693       64      2772K
> >>>>>> task_delay_info
> >>>>>>>>>
> >>>>>>>>>      43516  43471  99%    0.69K    946       46     30272K
> >>>>>> sock_inode_cache
> >>>>>>>>>
> >>>>>>>>>      37856  27626  72%    1.00K   1183       32     37856K
> >>> kmalloc-1024
> >>>>>>>>>
> >>>>>>>>>      36736  36736 100%    0.07K    656       56      2624K
> >>> eventpoll_pwq
> >>>>>>>>>
> >>>>>>>>>      34076  31282  91%    0.57K   1217       28     19472K
> >>>>>> radix_tree_node
> >>>>>>>>>
> >>>>>>>>>      33660  30528  90%    1.05K   1122       30     35904K
> >>>>>> ext4_inode_cache
> >>>>>>>>>
> >>>>>>>>>      32760  30959  94%    0.19K    780       42      6240K
> kmalloc-192
> >>>>>>>>>
> >>>>>>>>>      32028  32028 100%    0.04K    314      102      1256K
> >>>>>> ext4_extent_status
> >>>>>>>>>
> >>>>>>>>>      30048  30048 100%    0.25K    939       32      7512K
> >>>>>> skbuff_head_cache
> >>>>>>>>>
> >>>>>>>>>      28736  28736 100%    0.06K    449       64      1796K
> fs_cache
> >>>>>>>>>
> >>>>>>>>>      24702  24702 100%    0.69K    537       46     17184K
> files_cache
> >>>>>>>>>
> >>>>>>>>>      23808  23808 100%    0.66K    496       48     15872K
> ovl_inode
> >>>>>>>>>
> >>>>>>>>>      23104  22945  99%    0.12K    722       32      2888K
> kmalloc-128
> >>>>>>>>>
> >>>>>>>>>      22724  21307  93%    0.69K    494       46     15808K
> >>>>>> shmem_inode_cache
> >>>>>>>>>
> >>>>>>>>>      21472  21472 100%    0.12K    671       32      2684K
> seq_file
> >>>>>>>>>
> >>>>>>>>>      19904  19904 100%    1.00K    622       32     19904K UNIX
> >>>>>>>>>
> >>>>>>>>>      17340  17340 100%    1.06K    578       30     18496K
> mm_struct
> >>>>>>>>>
> >>>>>>>>>      15980  15980 100%    0.02K     94      170       376K
> avtab_node
> >>>>>>>>>
> >>>>>>>>>      14070  14070 100%    1.06K    469       30     15008K
> >>> signal_cache
> >>>>>>>>>
> >>>>>>>>>      13248  13248 100%    0.12K    414       32      1656K pid
> >>>>>>>>>
> >>>>>>>>>      12128  11777  97%    0.25K    379       32      3032K
> kmalloc-256
> >>>>>>>>>
> >>>>>>>>>      11008  11008 100%    0.02K     43      256       172K
> >>>>>>>>> selinux_file_security
> >>>>>>>>>      10812  10812 100%    0.04K    106      102       424K
> >>> Acpi-Namespace
> >>>>>>>>>
> >>>>>>>>> These information shows that the 'iommu_iova' is the top memory
> >>>>>> consumer.
> >>>>>>>>> In order to optimize the network performence of Openstack virtual
> >>>>>> machines,
> >>>>>>>>> I enabled the vt-d feature in bios and sriov feature of Intel
> 82599
> >>>>>> 10G
> >>>>>>>>> NIC. I'm assuming this is the root cause of this issue.
> >>>>>>>>>
> >>>>>>>>> Is there anything I can do to fix it?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> iommu mailing list
> >>>>>>> iommu@lists.linux-foundation.org
> >>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> > _______________________________________________
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >
>
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to