hotremove at subsection size

Zi Yan Mon, 10 May 2021 07:37:25 -0700

On 7 May 2021, at 10:00, David Hildenbrand wrote:

> On 07.05.21 13:55, Michal Hocko wrote:
>> [I haven't read through respective patches due to lack of time but let
>>   me comment on the general idea and the underlying justification]
>>
>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
>>> On 06.05.21 17:26, Zi Yan wrote:
>>>> From: Zi Yan <z...@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset tries to remove the restriction on memory hotplug/hotremove
>>>> granularity, which is always greater or equal to memory section size[1].
>>>> With the patchset, kernel is able to online/offline memory at a size 
>>>> independent
>>>> of memory section size, as small as 2MB (the subsection size).
>>>
>>> ... which doesn't make any sense as we can only online/offline whole memory
>>> block devices.
>>
>> Agreed. The subsection thingy is just a hack to workaround pmem
>> alignement problems. For the real memory hotplug it is quite hard to
>> argue for reasonable hotplug scenarios for very small physical memory
>> ranges wrt. to the existing sparsemem memory model.
>>
>>>> The motivation is to increase MAX_ORDER of the buddy allocator and 
>>>> pageblock
>>>> size without increasing memory hotplug/hotremove granularity at the same 
>>>> time,
>>>
>>> Gah, no. Please no. No.
>>
>> Agreed. Those are completely independent concepts. MAX_ORDER is can be
>> really arbitrary irrespective of the section size with vmemmap sparse
>> model. The existing restriction is due to old sparse model not being
>> able to do page pointer arithmetic across memory sections. Is there any
>> reason to stick with that memory model for an advance feature you are
>> working on?


No. I just want to increase MAX_ORDER. If the existing restriction can
be removed, that will be great.

>
> I gave it some more thought yesterday. I guess the first thing we should look 
> into is increasing MAX_ORDER and leaving pageblock_order and section size as 
> is -- finding out what we have to tweak to get that up and running. Once we 
> have that in place, we can actually look into better fragmentation avoidance 
> etc. One step at a time.

It makes sense to me.

>
> Because that change itself might require some thought. Requiring that bigger 
> MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.

OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
all page++ will need to be changed to nth_page(page,1)).

>
> As stated somewhere here already, we'll have to look into making 
> alloc_contig_range() (and main users CMA and virtio-mem) independent of 
> MAX_ORDER and mainly rely on pageblock_order. The current handling in 
> alloc_contig_range() is far from optimal as we have to isolate a whole 
> MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part 
> contains something unmovable although we don't even want to allocate that 
> part. I actually have that on my list (to be able to fully support 
> pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't 
> have time to look into it.

So in your mind, for gigantic page allocation (> MAX_ORDER), 
alloc_contig_range()
should be used instead of buddy allocator while pageblock_order is kept at a 
small
granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
when any of the pageblocks within a gigantic page range (like 1GB) becomes 
unmovable?
Are you thinking additional mechanism/policy to prevent such thing happening as
an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE 
idea?

>
> Further, page onlining / offlining code and early init code most probably 
> also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might 
> suddenly have in MAX_ORDER - 1 pages might become a problem and will have to 
> be handled. Not sure which other code has to be tweaked (compaction? page 
> isolation?).

Can you elaborate it a little more? From what I understand, memory holes mean 
valid
PFNs are not contiguous before and after a hole, so pfn++ will not work, but
struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning 
page++
would still work. So when MAX_ORDER - 1 crosses sections, additional code would 
be
needed instead of simple pfn++. Is there anything I am missing?

BTW, to test a system with memory holes, do you know is there an easy of adding
random memory holes to an x86_64 VM, which can help reveal potential missing 
pieces
in the code? Changing BIOS-e820 table might be one way, but I have no idea on
how to do it on QEMU.

>
> Figuring out what needs care itself might take quite some effort.
>
> One thing I was thinking about as well: The bigger our MAX_ORDER, the slower 
> it could be to allocate smaller pages. If we have 1G pages, splitting them 
> down to 4k then takes 8 additional steps if I'm, not wrong. Of course, that's 
> the worst case. Would be interesting to evaluate.

Sure. I am planning to check it too. As a simple start, I am going to run will 
it scale
benchmarks to see if there is any performance difference between different 
MAX_ORDERs.

Thank you for all these valuable inputs. They are very helpful. I appreciate 
them.


—
Best Regards,
Yan Zi

signature.asc
Description: OpenPGP digital signature

Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

Reply via email to