On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote: > Hi Bao-Long, > > On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote: > > Hi, > > > > I'm not sure if this is a bug, but I've seen an inconsistency in the > > behavior > > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for > > the > > same mempool size, the number of hugepages allocated changes from run to > > run. > > > > Here's how I reproduce with DPDK 19.11. IOVA=pa (default) > > > > 1. Reserve 16x1G hugepages on socket 0 > > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run > > make && ./build/basicfwd > > 3. At the same time, watch the number of hugepages allocated > > "watch -n.1 ls /dev/hugepages" > > 4. Repeat step 2 > > > > If you can reproduce, you should see that for some runs, DPDK allocates 5 > > hugepages, other times it allocates 6. When it allocates 6, if you watch > > the > > output from step 3., you will see that DPDK first try to allocate 5 > > hugepages, > > then unmap all 5, retry, and got 6. > > I cannot reproduce in the same conditions than yours (with 16 hugepages > on socket 0), but I think I can see a similar issue: > > If I reserve at least 6 hugepages, it seems reproducible (6 hugepages > are used). If I reserve 5 hugepages, it takes more time, > taking/releasing hugepages several times, and it finally succeeds with 5 > hugepages. > > > For our use case, it's important that DPDK allocate the same number of > > hugepages on every run so we can get reproducable results. > > One possibility is to use the --legacy-mem EAL option. It will try to > reserve all hugepages first.
Passing --socket-mem=5120,0 also does the job. > > Studying the code, this seems to be the behavior of > > rte_mempool_populate_default(). If I understand correctly, if the first try > > fail > > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous > > condition, and eventually wound up with 6 hugepages. > > No, I think you don't have the IOVA-contiguous constraint in your > case. This is what I see: > > a- reserve 5 hugepages on socket 0, and start your patched basicfwd > b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824 > c- the total element size (with header) is 2304 + 64 = 2368 > d- in rte_mempool_op_calc_mem_size_helper(), it calculates > obj_per_page = 453438 (453438 * 2368 = 1073741184) > mem_size = 4966058495 > e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with: > rte_memzone_reserve_aligned(name, size=4966058495, socket=0, > mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY, > align=64) > For some reason, it fails: we can see that the number of map'd hugepages > increases in /dev/hugepages, the return to its original value. > I don't think it should fail here. > f- then, it will try to allocate the biggest available contiguous zone. In > my case, it is 1055291776 bytes (almost all the uniq map'd hugepage). > This is a second problem: if we call it again, it returns NULL, because > it won't map another hugepage. > g- by luck, calling rte_mempool_populate_virt() allocates a small aera > (mempool header), and it triggers the mapping a a new hugepage, that > will be used in the next loop, back at step d with a smaller mem_size. > > > Questions: > > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage > > memory > > is abundant? > > In my case, it looks that we have a bit less than 1G which is free at > the end of the heap, than we call rte_memzone_reserve_aligned(size=5G). > The allocator ends up in mapping 5 pages (and fail), while only 4 is > needed. > > Anatoly, do you have any idea? Shouldn't we take in account the amount > of free space at the end of the heap when expanding? > > > 2. Why does the 2nd retry need N+1 hugepages? > > When the first alloc fails, the mempool code tries to allocate in > several chunks which are not virtually contiguous. This is needed in > case the memory is fragmented. > > > Some insights for Q1: From my experiments, seems like the IOVA of the first > > hugepage is not guaranteed to be at the start of the IOVA space > > (understandably). > > It could explain the retry when the IOVA of the first hugepage is near the > > end of > > the IOVA space. But I have also seen situation where the 1st hugepage is > > near > > the beginning of the IOVA space and it still failed the 1st time. > > > > Here's the code: > > #include <rte_eal.h> > > #include <rte_mbuf.h> > > > > int > > main(int argc, char *argv[]) > > { > > struct rte_mempool *mbuf_pool; > > unsigned mbuf_pool_size = 2097151; > > > > int ret = rte_eal_init(argc, argv); > > if (ret < 0) > > rte_exit(EXIT_FAILURE, "Error with EAL initialization\n"); > > > > printf("Creating mbuf pool size=%u\n", mbuf_pool_size); > > mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size, > > 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0); > > > > printf("mbuf_pool %p\n", mbuf_pool); > > > > return 0; > > } > > > > Best regards, > > BL > > Regards, > Olivier