On 1/3/2024 10:43 PM, Christoph Lameter (Ampere) wrote:
> On Thu, 28 Dec 2023, [email protected] wrote:
>
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -626,7 +628,14 @@ struct mm_struct {
>>         unsigned long mmap_compat_legacy_base;
>> #endif
>>         unsigned long task_size;    /* size of task vm space */
>> -        pgd_t * pgd;
>> +#ifndef CONFIG_KERNEL_REPLICATION
>> +        pgd_t *pgd;
>> +#else
>> +        union {
>> +            pgd_t *pgd;
>> +            pgd_t *pgd_numa[MAX_NUMNODES];
>> +        };
>> +#endif
>
>
> Hmmm... This is adding the pgd pointers for all mm_structs. But we only need 
> the numa pgs pointers for the init_mm. Can this be a separate variable? There 
> are some architecures with larger number of nodes.
>
>
>

Hi, Christoph.

Sorry for such delay with the reply.

We already have per-NUMA node init_mm, but this is not enough.
We need this array of pointers in the task struct due to the proper pgd 
(per-NUMA node) should be used for threads of process that occupy more than one 
NUMA node.
On x86 we have one translation table per-process that contains both kernel and 
user space part. In case of kernel text and rodata replication enabled, we need 
to take
into account per-NUMA node kernel text and rodata replicas during the context 
switch and etc. For example, if particular thread runs a system call, we need 
to use the
kernel replica that corresponds to the NUMA node the thread running on. At the 
same time, the process can occupy several NUMA nodes, and the threads running 
on different
NUMA nodes should observe one user space version, but different kernel versions 
(per-NUMA node replicas).

But you are right that this place should be optimized. We no need this array 
for the processes that not expected to work in cross-NUMA node way. Possibly, we
need to implement some "lazy" approach for per-NUMA node translation tables 
allocation. Current version of kernel replication support is implemented in a 
way
when we try to do all the things as simple as possible.

Thank you!

Best regards,
Artem


Reply via email to