On 7 May 2021, at 10:00, David Hildenbrand wrote: > On 07.05.21 13:55, Michal Hocko wrote: >> [I haven't read through respective patches due to lack of time but let >> me comment on the general idea and the underlying justification] >> >> On Thu 06-05-21 17:31:09, David Hildenbrand wrote: >>> On 06.05.21 17:26, Zi Yan wrote: >>>> From: Zi Yan <z...@nvidia.com> >>>> >>>> Hi all, >>>> >>>> This patchset tries to remove the restriction on memory hotplug/hotremove >>>> granularity, which is always greater or equal to memory section size[1]. >>>> With the patchset, kernel is able to online/offline memory at a size >>>> independent >>>> of memory section size, as small as 2MB (the subsection size). >>> >>> ... which doesn't make any sense as we can only online/offline whole memory >>> block devices. >> >> Agreed. The subsection thingy is just a hack to workaround pmem >> alignement problems. For the real memory hotplug it is quite hard to >> argue for reasonable hotplug scenarios for very small physical memory >> ranges wrt. to the existing sparsemem memory model. >> >>>> The motivation is to increase MAX_ORDER of the buddy allocator and >>>> pageblock >>>> size without increasing memory hotplug/hotremove granularity at the same >>>> time, >>> >>> Gah, no. Please no. No. >> >> Agreed. Those are completely independent concepts. MAX_ORDER is can be >> really arbitrary irrespective of the section size with vmemmap sparse >> model. The existing restriction is due to old sparse model not being >> able to do page pointer arithmetic across memory sections. Is there any >> reason to stick with that memory model for an advance feature you are >> working on?
No. I just want to increase MAX_ORDER. If the existing restriction can be removed, that will be great. > > I gave it some more thought yesterday. I guess the first thing we should look > into is increasing MAX_ORDER and leaving pageblock_order and section size as > is -- finding out what we have to tweak to get that up and running. Once we > have that in place, we can actually look into better fragmentation avoidance > etc. One step at a time. It makes sense to me. > > Because that change itself might require some thought. Requiring that bigger > MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do. OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I want to add ultimately, will require SPARSE_VMEMMAP too (otherwise, all page++ will need to be changed to nth_page(page,1)). > > As stated somewhere here already, we'll have to look into making > alloc_contig_range() (and main users CMA and virtio-mem) independent of > MAX_ORDER and mainly rely on pageblock_order. The current handling in > alloc_contig_range() is far from optimal as we have to isolate a whole > MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part > contains something unmovable although we don't even want to allocate that > part. I actually have that on my list (to be able to fully support > pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't > have time to look into it. So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_contig_range() should be used instead of buddy allocator while pageblock_order is kept at a small granularity like 2MB. Is that the case? Isn’t it going to have high fail rate when any of the pageblocks within a gigantic page range (like 1GB) becomes unmovable? Are you thinking additional mechanism/policy to prevent such thing happening as an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE idea? > > Further, page onlining / offlining code and early init code most probably > also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might > suddenly have in MAX_ORDER - 1 pages might become a problem and will have to > be handled. Not sure which other code has to be tweaked (compaction? page > isolation?). Can you elaborate it a little more? From what I understand, memory holes mean valid PFNs are not contiguous before and after a hole, so pfn++ will not work, but struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning page++ would still work. So when MAX_ORDER - 1 crosses sections, additional code would be needed instead of simple pfn++. Is there anything I am missing? BTW, to test a system with memory holes, do you know is there an easy of adding random memory holes to an x86_64 VM, which can help reveal potential missing pieces in the code? Changing BIOS-e820 table might be one way, but I have no idea on how to do it on QEMU. > > Figuring out what needs care itself might take quite some effort. > > One thing I was thinking about as well: The bigger our MAX_ORDER, the slower > it could be to allocate smaller pages. If we have 1G pages, splitting them > down to 4k then takes 8 additional steps if I'm, not wrong. Of course, that's > the worst case. Would be interesting to evaluate. Sure. I am planning to check it too. As a simple start, I am going to run will it scale benchmarks to see if there is any performance difference between different MAX_ORDERs. Thank you for all these valuable inputs. They are very helpful. I appreciate them. — Best Regards, Yan Zi
signature.asc
Description: OpenPGP digital signature