Re: [PATCH v3 09/25] media: i2c: imx258: Add support for running on 2 CSI data lanes

2024-04-05 Thread Luis Garcia
On 4/3/24 12:45, Pavel Machek wrote:
> Hi!
> 
>> +/*
>> + * 4208x3120 @ 30 fps needs 1267Mbps/lane, 4 lanes.
>> + * To avoid further computation of clock settings, adopt the same per
>> + * lane data rate when using 2 lanes, thus allowing a maximum of 15fps.
>> + */
>> +static const struct imx258_reg mipi_1267mbps_19_2mhz_2l[] = {
>> +{ 0x0136, 0x13 },
>> +{ 0x0137, 0x33 },
>> +{ 0x0301, 0x0A },
>> +{ 0x0303, 0x02 },
>> +{ 0x0305, 0x03 },
>> +{ 0x0306, 0x00 },
>> +{ 0x0307, 0xC6 },
>> +{ 0x0309, 0x0A },
>> +{ 0x030B, 0x01 },
>> +{ 0x030D, 0x02 },
>> +{ 0x030E, 0x00 },
>> +{ 0x030F, 0xD8 },
>> +{ 0x0310, 0x00 },
>> +
>> +{ 0x0114, 0x01 },
>> +{ 0x0820, 0x09 },
>> +{ 0x0821, 0xa6 },
>> +{ 0x0822, 0x66 },
>> +{ 0x0823, 0x66 },
>> +};
>> +
>> +static const struct imx258_reg mipi_1267mbps_19_2mhz_4l[] = {
>>  { 0x0136, 0x13 },
>>  { 0x0137, 0x33 },
>>  { 0x0301, 0x05 },
> 
> I wish we did not have to copy all the magic values like this.
> 
> Best regards,
>   Pavel
>   

no kidding, magic values everywhere it makes it annoying
for me to move things around because they all start to look
similar. Down the line we added in more defined names so its
not as bad but still its bad lol.



Re: [PATCH v3 21/25] drivers: media: i2c: imx258: Use macros

2024-04-05 Thread Luis Garcia
On 4/5/24 08:11, Tommaso Merciai wrote:
> Hi Luis,
> 
> On Fri, Apr 05, 2024 at 04:33:38AM -0600, Luis Garcia wrote:
>> On 4/4/24 00:46, Sakari Ailus wrote:
>>> On Wed, Apr 03, 2024 at 01:17:26PM -0600, Luigi311 wrote:
 On 4/3/24 10:23, Sakari Ailus wrote:
> Hi Luis,
>
> On Wed, Apr 03, 2024 at 09:03:50AM -0600, g...@luigi311.com wrote:
>> From: Luis Garcia 
>>
>> Use understandable macros instead of raw values.
>>
>> Signed-off-by: Ondrej Jirman 
>> Signed-off-by: Luis Garcia 
>> ---
>>  drivers/media/i2c/imx258.c | 434 ++---
>>  1 file changed, 207 insertions(+), 227 deletions(-)
>>
>> diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c
>> index e2ecf6109516..30352c33f63c 100644
>> --- a/drivers/media/i2c/imx258.c
>> +++ b/drivers/media/i2c/imx258.c
>> @@ -33,8 +33,6 @@
>>  #define IMX258_VTS_30FPS_VGA0x034c
>>  #define IMX258_VTS_MAX  65525
>>  
>> -#define IMX258_REG_VTS  0x0340
>> -
>>  /* HBLANK control - read only */
>>  #define IMX258_PPL_DEFAULT  5352
>>  
>> @@ -90,6 +88,53 @@
>>  #define IMX258_PIXEL_ARRAY_WIDTH4208U
>>  #define IMX258_PIXEL_ARRAY_HEIGHT   3120U
>>  
>> +/* regs */
>> +#define IMX258_REG_PLL_MULT_DRIV  0x0310
>> +#define IMX258_REG_IVTPXCK_DIV0x0301
>> +#define IMX258_REG_IVTSYCK_DIV0x0303
>> +#define IMX258_REG_PREPLLCK_VT_DIV0x0305
>> +#define IMX258_REG_IOPPXCK_DIV0x0309
>> +#define IMX258_REG_IOPSYCK_DIV0x030b
>> +#define IMX258_REG_PREPLLCK_OP_DIV0x030d
>> +#define IMX258_REG_PHASE_PIX_OUTEN0x3030
>> +#define IMX258_REG_PDPIX_DATA_RATE0x3032
>> +#define IMX258_REG_SCALE_MODE 0x0401
>> +#define IMX258_REG_SCALE_MODE_EXT 0x3038
>> +#define IMX258_REG_AF_WINDOW_MODE 0x7bcd
>> +#define IMX258_REG_FRM_LENGTH_CTL 0x0350
>> +#define IMX258_REG_CSI_LANE_MODE  0x0114
>> +#define IMX258_REG_X_EVN_INC  0x0381
>> +#define IMX258_REG_X_ODD_INC  0x0383
>> +#define IMX258_REG_Y_EVN_INC  0x0385
>> +#define IMX258_REG_Y_ODD_INC  0x0387
>> +#define IMX258_REG_BINNING_MODE   0x0900
>> +#define IMX258_REG_BINNING_TYPE_V 0x0901
>> +#define IMX258_REG_FORCE_FD_SUM   0x300d
>> +#define IMX258_REG_DIG_CROP_X_OFFSET  0x0408
>> +#define IMX258_REG_DIG_CROP_Y_OFFSET  0x040a
>> +#define IMX258_REG_DIG_CROP_IMAGE_WIDTH   0x040c
>> +#define IMX258_REG_DIG_CROP_IMAGE_HEIGHT  0x040e
>> +#define IMX258_REG_SCALE_M0x0404
>> +#define IMX258_REG_X_OUT_SIZE 0x034c
>> +#define IMX258_REG_Y_OUT_SIZE 0x034e
>> +#define IMX258_REG_X_ADD_STA  0x0344
>> +#define IMX258_REG_Y_ADD_STA  0x0346
>> +#define IMX258_REG_X_ADD_END  0x0348
>> +#define IMX258_REG_Y_ADD_END  0x034a
>> +#define IMX258_REG_EXCK_FREQ  0x0136
>> +#define IMX258_REG_CSI_DT_FMT 0x0112
>> +#define IMX258_REG_LINE_LENGTH_PCK0x0342
>> +#define IMX258_REG_SCALE_M_EXT0x303a
>> +#define IMX258_REG_FRM_LENGTH_LINES   0x0340
>> +#define IMX258_REG_FINE_INTEG_TIME0x0200
>> +#define IMX258_REG_PLL_IVT_MPY0x0306
>> +#define IMX258_REG_PLL_IOP_MPY0x030e
>> +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_H   0x0820
>> +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_L   0x0822
>> +
>> +#define REG8(a, v) { a, v }
>> +#define REG16(a, v) { a, ((v) >> 8) & 0xff }, { (a) + 1, (v) & 0xff }
>
> The patch is nice but these macros are better replaced by the V4L2 CCI
> helper that also offers register access functions. Could you add a patch 
> to
> convert the driver to use it (maybe after this one)?
>

 Ohh perfect, using something else would be great. Ill go ahead and see
 if I can get that working.
>>>
>>> Thanks. It may be easier to just do it in this one actually. Up to you.
>>>
>>
>> I've made the swap but looks like its not playing nice with my ppp,
>> its causing a crash and showing a call trace as soon as it does its
>> first read to check the identity. I went in and dropped the cci_read
>> and left it with the original implementation and I'm getting a very
>> similar crash with cci_write too so it looks like its not liking

Re: [PATCH v3 05/25] media: i2c: imx258: Add regulator control

2024-04-05 Thread Luis Garcia
On 4/3/24 12:44, Pavel Machek wrote:
> Hi!
> 
>> The device tree bindings define the relevant regulators for the
>> sensor, so update the driver to request the regulators and control
>> them at the appropriate times.
> 
>> @@ -995,9 +1007,19 @@ static int imx258_power_on(struct device *dev)
>>  struct imx258 *imx258 = to_imx258(sd);
>>  int ret;
>>  
>> +ret = regulator_bulk_enable(IMX258_NUM_SUPPLIES,
>> +imx258->supplies);
>> +if (ret) {
> 
> Will this make it fail for all current users?
> 
> Best regards,
>   Pavel
>   

It shouldn't affect current users as this was added in by dave for
a completely different sensor and it still works on my ppp. Looking
at the dmesg for imx258 it does reference the regulators that the
ppp doesnt have but it still works.



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-05 Thread Google
On Tue, 2 Apr 2024 22:21:00 -0700
Andrii Nakryiko  wrote:

> On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
>  wrote:
> >
> > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
> > >
> > > On Wed, 3 Apr 2024 09:40:48 +0900
> > > Masami Hiramatsu (Google)  wrote:
> > >
> > > > OK, for me, this last sentence is preferred for the help message. That 
> > > > explains
> > > > what this is for.
> > > >
> > > > All callbacks that attach to the function tracing have some sort
> > > > of protection against recursion. This option is only to verify 
> > > > that
> > > >    ftrace (and other users of ftrace_test_recursion_trylock()) are 
> > > >not
> > > > called outside of RCU, as if they are, it can cause a race.
> > > > But it also has a noticeable overhead when enabled.
> >
> > Sounds good to me, I can add this to the description of the Kconfig option.
> >
> > > >
> > > > BTW, how much overhead does this introduce? and the race case a kernel 
> > > > crash?
> >
> > I just checked our fleet-wide production data for the last 24 hours.
> > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > everything called from it), rcu_is_watching (both calls, see below)
> > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > prefer to be able to avoid that in production use cases.
> >
> 
> I just ran synthetic microbenchmark testing multi-kretprobe
> throughput. We get (in millions of BPF kretprobe-multi program
> invocations per second):
>   - 5.568M/s as baseline;
>   - 5.679M/s with changes in this patch (+2% throughput improvement);
>   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> 
> It's definitely noticeable.

Thanks for checking the overhead! Hmm, it is considerable.

> > > > or just messed up the ftrace buffer?
> > >
> > > There's a hypothetical race where it can cause a use after free.

Hmm, so it might not lead a kernel crash but is better to enable with
other debugging options.

> > >
> > > That is, after you shutdown ftrace, you need to call 
> > > synchronize_rcu_tasks(),
> > > which requires RCU to be watching. There's a theoretical case where that
> > > task calls the trampoline and misses the synchronization. Note, these
> > > locations are with preemption disabled, as rcu is always watching when
> > > preemption is enabled. Thus it would be extremely fast where as the
> > > synchronize_rcu_tasks() is rather slow.
> > >
> > > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > > trace from happening, as it would schedule on each CPU forcing all CPUs to
> > > have RCU watching.
> > >
> > > I have never heard of this race being hit. I guess it could happen on a VM
> > > where a vCPU gets preempted at the right moment for a long time and the
> > > other CPUs synchronize.
> > >
> > > But like lockdep, where deadlocks can crash the kernel, we don't enable it
> > > for production.
> > >
> > > The overhead is another function call within the function tracer. I had
> > > numbers before, but I guess I could run tests again and get new numbers.
> > >
> >
> > I just noticed another rcu_is_watching() call, in rethook_try_get(),
> > which seems to be a similar and complementary validation check to the
> > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> > in this patch. It feels like both of them should be controlled by the
> > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > guard around rcu_is_watching() check in rethook_try_get() as well?

Hmmm, no, I think it should not change the rethook side because rethook
can be used with kprobes without ftrace. If we can detect it is used from
the ftrace, we can skip it. (From this reason, I would like to remove
return probe from kprobes...)

Thank you,

> >
> >
> > > Thanks,
> > >
> > > -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-05 Thread Google
On Fri, 5 Apr 2024 13:02:30 +0200
Oleg Nesterov  wrote:

> On 04/05, Jiri Olsa wrote:
> >
> > On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> > >
> > > I think this expects setjmp/longjmp as below
> > >
> > > foo() { <- retprobe1
> > >   setjmp()
> > >   bar() { <- retprobe2
> > >   longjmp()
> > >   }
> > > } <- return to trampoline
> > >
> > > In this case, we need to skip retprobe2's instance.
> 
> Yes,
> 
> > > My concern is, if we can not find appropriate return instance, what 
> > > happen?
> > > e.g.
> > >
> > > foo() { <-- retprobe1
> > >bar() { # sp is decremented
> > >sys_uretprobe() <-- ??
> > > }
> > > }
> > >
> > > It seems sys_uretprobe() will handle retprobe1 at that point instead of
> > > SIGILL.
> >
> > yes, and I think it's fine, you get the consumer called in wrong place,
> > but it's your fault and kernel won't crash
> 
> Agreed.
> 
> With or without this patch userpace can also do
> 
>   foo() { <-- retprobe1
>   bar() {
>   jump to xol_area
>   }
>   }
> 
> handle_trampoline() will handle retprobe1.

This is OK because the execution path has been changed to trampoline, but
the above will continue running bar() after sys_uretprobe().

> 
> > this can be fixed by checking the syscall is called from the trampoline
> > and prevent handle_trampoline call if it's not
> 
> Yes, but I still do not think this makes a lot of sense. But I won't argue.
> 
> And what should sys_uretprobe() do if it is not called from the trampoline?
> I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
> 
> I agree very much with Andrii,
> 
>sigreturn()  exists only to allow the implementation of signal 
> handlers.  It should never be
>called directly.  Details of the arguments (if any) passed to 
> sigreturn() vary depending  on
>the architecture.
> 
> this is how sys_uretprobe() should be treated/documented.

OK.

> 
> sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
> and return -EINVAL if this addr is not valid. But why?

Because sigreturn() never returns, but sys_uretprobe() will return.
If it changes the regs->ip and directly returns to the original return address,
I think it is natural to send SIGILL.


Thank you,

-- 
Masami Hiramatsu (Google) 



Re: [PATCH fs/proc/bootconfig] remove redundant comments from /proc/bootconfig

2024-04-05 Thread Google
On Thu, 4 Apr 2024 21:25:41 -0700
"Paul E. McKenney"  wrote:

> On Fri, Apr 05, 2024 at 11:57:45AM +0900, Masami Hiramatsu wrote:
> > On Fri, 5 Apr 2024 10:23:24 +0900
> > Masami Hiramatsu (Google)  wrote:
> > 
> > > On Thu, 4 Apr 2024 10:43:14 -0700
> > > "Paul E. McKenney"  wrote:
> > > 
> > > > On Thu, Apr 04, 2024 at 08:55:22AM +0900, Masami Hiramatsu wrote:
> > > > > On Wed, 3 Apr 2024 12:16:28 -0700
> > > > > "Paul E. McKenney"  wrote:
> > > > > 
> > > > > > commit 717c7c894d4b ("fs/proc: Add boot loader arguments as comment 
> > > > > > to
> > > > > > /proc/bootconfig") adds bootloader argument comments into 
> > > > > > /proc/bootconfig.
> > > > > > 
> > > > > > /proc/bootconfig shows boot_command_line[] multiple times following
> > > > > > every xbc key value pair, that's duplicated and not necessary.
> > > > > > Remove redundant ones.
> > > > > > 
> > > > > > Output before and after the fix is like:
> > > > > > key1 = value1
> > > > > > *bootloader argument comments*
> > > > > > key2 = value2
> > > > > > *bootloader argument comments*
> > > > > > key3 = value3
> > > > > > *bootloader argument comments*
> > > > > > ...
> > > > > > 
> > > > > > key1 = value1
> > > > > > key2 = value2
> > > > > > key3 = value3
> > > > > > *bootloader argument comments*
> > > > > > ...
> > > > > > 
> > > > > > Fixes: 717c7c894d4b ("fs/proc: Add boot loader arguments as comment 
> > > > > > to /proc/bootconfig")
> > > > > > Signed-off-by: Zhenhua Huang 
> > > > > > Signed-off-by: Paul E. McKenney 
> > > > > > Cc: Masami Hiramatsu 
> > > > > > Cc: 
> > > > > > Cc: 
> > > > > 
> > > > > OOps, good catch! Let me pick it.
> > > > > 
> > > > > Acked-by: Masami Hiramatsu (Google) 
> > > > 
> > > > Thank you, and I have applied your ack and pulled this into its own
> > > > bootconfig.2024.04.04a.
> > > > 
> > > > My guess is that you will push this via your own tree, and so I will
> > > > drop my copy as soon as yours hits -next.
> > > 
> > > Thanks! I would like to make PR this soon as bootconfig fixes for 
> > > v6.9-rc2.
> > 
> > Hmm I found that this always shows the command line comment in
> > /proc/bootconfig even without "bootconfig" option.
> > I think that is easier for user-tools but changes the behavior and
> > a bit redundant.
> > 
> > We should skip showing this original argument comment if bootconfig is
> > not initialized (no "bootconfig" in cmdline) as it is now.
> 
> So something like this folded into that patch?

Hm, I expected just checking it in the loop as below.


diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
index e5635a6b127b..98e0780f7e07 100644
--- a/fs/proc/bootconfig.c
+++ b/fs/proc/bootconfig.c
@@ -27,6 +27,7 @@ static int __init copy_xbc_key_value_list(char *dst, size_t 
size)
 {
struct xbc_node *leaf, *vnode;
char *key, *end = dst + size;
+   bool empty = true;
const char *val;
char q;
int ret = 0;
@@ -62,8 +63,9 @@ static int __init copy_xbc_key_value_list(char *dst, size_t 
size)
break;
dst += ret;
}
+   empty = false;
}
-   if (ret >= 0 && boot_command_line[0]) {
+   if (!empty && ret >= 0 && boot_command_line[0]) {
ret = snprintf(dst, rest(dst, end), "# Parameters from 
bootloader:\n# %s\n",
   boot_command_line);
if (ret > 0)



The difference is checking "bootconfig" cmdline option or checking
the "bootconfig" is actually empty. So the behaviors are different
when the "bootconfig" is specified but there is no bootconfig data.

Another idea is to check whether the cmdline is actually updated by
bootconfig and show original one only if it is updated.
(I think this fits the purpose of the original patch better.)


diff --git a/fs/proc/bootconfig.c b/fs/proc/bootconfig.c
index e5635a6b127b..95d6a231210c 100644
--- a/fs/proc/bootconfig.c
+++ b/fs/proc/bootconfig.c
@@ -10,6 +10,9 @@
 #include 
 #include 
 
+/* defined in main/init.c */
+bool __init cmdline_has_extra_options(void);
+
 static char *saved_boot_config;
 
 static int boot_config_proc_show(struct seq_file *m, void *v)
@@ -63,7 +66,7 @@ static int __init copy_xbc_key_value_list(char *dst, size_t 
size)
dst += ret;
}
}
-   if (ret >= 0 && boot_command_line[0]) {
+   if (cmdline_has_extra_options() && ret >= 0 && boot_command_line[0]) {
ret = snprintf(dst, rest(dst, end), "# Parameters from 
bootloader:\n# %s\n",
   boot_command_line);
if (ret > 0)
diff --git a/init/main.c b/init/main.c
index 2ca52474d0c3..881f6230ee59 100644
--- a/init/main.c
+++ b/init/main.c
@@ -487,6 +487,11 @@ static int __init 

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-05 Thread Ho-Ren (Jack) Chuang
On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
 wrote:
>
> On Fri,  5 Apr 2024 00:07:06 +
> "Ho-Ren (Jack) Chuang"  wrote:
>
> > The current implementation treats emulated memory devices, such as
> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> > (E820_TYPE_RAM). However, these emulated devices have different
> > characteristics than traditional DRAM, making it important to
> > distinguish them. Thus, we modify the tiered memory initialization process
> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > ensures that the memory tier initialization for these nodes is deferred
> > until HMAT information is obtained during the boot process. Finally,
> > demotion tables are recalculated at the end.
> >
> > * late_initcall(memory_tier_late_init);
> > Some device drivers may have initialized memory tiers between
> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > online memory nodes and configuring memory tiers. They should be excluded
> > in the late init.
> >
> > * Handle cases where there is no HMAT when creating memory tiers
> > There is a scenario where a CPUless node does not provide HMAT information.
> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >
> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
> > In the current implementation, iterating through CPUlist nodes requires
> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> > trying to acquire the same lock, leading to a potential deadlock.
> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > but also prevents holding a large lock simultaneously.
> >
> > * Upgrade `set_node_memory_tier` to support additional cases, including
> >   default DRAM, late CPUless, and hot-plugged initializations.
> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > handle cases where memtype is not initialized and where HMAT information is
> > available.
> >
> > * Introduce `default_memory_types` for those memory types that are not
> >   initialized by device drivers.
> > Because late initialized memory and default DRAM memory need to be managed,
> > a default memory type is created for storing all memory types that are
> > not initialized by device drivers and as a fallback.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Signed-off-by: Hao Xiang 
> > Reviewed-by: "Huang, Ying" 
>
> Hi - one remaining question. Why can't we delay init for all nodes
> to either drivers or your fallback late_initcall code.
> It would be nice to reduce possible code paths.

I try not to change too much of the existing code structure in
this patchset.

To me, postponing/moving all memory tier registrations to
late_initcall() is another possible action item for the next patchset.

After tier_mem(), hmat_init() is called, which requires registering
`default_dram_type` info. This is when `default_dram_type` is needed.
However, it is indeed possible to postpone the latter part,
set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
indeed be split into two parts, and the latter part can be moved to
late_initcall() to be processed together.

Doing this all memory-type drivers have to call late_initcall() to
register a memory tier. I’m not sure how many they are?

What do you guys think?

>
> Jonathan
>
>
> > ---
> >  mm/memory-tiers.c | 94 +++
> >  1 file changed, 70 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > index 516b144fd45a..6632102bd5c9 100644
> > --- a/mm/memory-tiers.c
> > +++ b/mm/memory-tiers.c
>
>
>
> > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
> >* For now we can have 4 faster memory tiers with smaller adistance
> >* than default DRAM tier.
> >*/
> > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> > +   _memory_types);
> >   if (IS_ERR(default_dram_type))
> >   panic("%s() failed to allocate default DRAM tier\n", 
> > __func__);
> >
> > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void)
> >* types assigned.
> >*/
> >   for_each_node_state(node, N_MEMORY) {
> > + if (!node_state(node, N_CPU))
> > + /*
> > +  * Defer memory tier initialization on
> > +  * CPUless numa nodes. These will be initialized
> > +  * after firmware and devices are initialized.
>
> Could the comment also say why we can't defer them all?
>
> (In an odd coincidence we have a similar issue for some CPU hotplug
>  

Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

2024-04-05 Thread SeongJae Park
Hello Honggyu,

On Fri,  5 Apr 2024 15:08:49 +0900 Honggyu Kim  wrote:

> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
> It says there is no implementation of the demote/promote DAMOS action
> are made.  This RFC is about its implementation for physical address
> space.
> 
> 
> Changes from RFC v2:
>   1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
>   2. Create 'target_nid' to set the migration target node instead of
>  depending on node distance based information.
>   3. Instead of having page level access check in this patch series,
>  delegate the job to a new DAMOS filter type YOUNG[2].
>   4. Introduce vmstat counters "damon_migrate_{hot,cold}".
>   5. Rebase from v6.7 to v6.8.

Thank you for patiently keeping discussion and making this great version!  I
left comments on each patch, but found no special concerns.  Per-page access
recheck for MIGRATE_HOT and vmstat change are taking my eyes, though.  I doubt
if those really needed.  It would be nice if you could answer to the comments.

Once my comments on this version are addressed, I would have no reason to
object at dropping the RFC tag from this patchset.

Nonetheless, I show some warnings and errors from checkpatch.pl.  I don't
really care about those for RFC patches, so no problem at all.  But if you
agree to my opinion about RFC tag dropping, and therefore if you will send next
version without RFC tag, please make sure you also run checkpatch.pl before
posting.


Thanks,
SJ

[...]



Re: [RFC PATCH v3 7/7] mm/damon: Add "damon_migrate_{hot,cold}" vmstat

2024-04-05 Thread SeongJae Park
On Fri,  5 Apr 2024 15:08:56 +0900 Honggyu Kim  wrote:

> This patch adds "damon_migrate_{hot,cold}" under node specific vmstat
> counters at the following location.
> 
>   /sys/devices/system/node/node*/vmstat
> 
> The counted values are accumulcated to the global vmstat so it also
> introduces the same counter at /proc/vmstat as well.

DAMON provides its own DAMOS stats via DAMON sysfs interface.  Do we really
need this change?


Thanks,
SJ

[...]



Re: [RFC PATCH v3 6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

2024-04-05 Thread SeongJae Park
On Fri,  5 Apr 2024 15:08:55 +0900 Honggyu Kim  wrote:

> From: Hyeongtak Ji 
> 
> This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
> DAMOS_MIGRATE_COLD, but it is targeted to migrate hot pages.

My understanding of our last discussion was that 'HOT/COLD' here is only for
prioritization score function.  If I'm not wrong, this is not for targeting,
but just prioritize migrating hot pages first under the quota.

> 
> It migrates pages inside the given region to the 'target_nid' NUMA node
> in the sysfs.
> 
> Here is one of the example usage of this 'migrate_hot' action.
> 
>   $ cd /sys/kernel/mm/damon/admin/kdamonds/
>   $ cat contexts//schemes//action
>   migrate_hot
>   $ echo 0 > contexts//schemes//target_nid
>   $ echo commit > state
>   $ numactl -p 2 ./hot_cold 500M 600M &
>   $ numastat -c -p hot_cold
> 
>   Per-node process memory usage (in MBs)
>   PID Node 0 Node 1 Node 2 Total
>   --  -- -- -- -
>   701 (hot_cold) 501  0601  1101
> 
> Signed-off-by: Hyeongtak Ji 
> Signed-off-by: Honggyu Kim 
> ---
>  include/linux/damon.h|  2 ++
>  mm/damon/paddr.c | 12 ++--
>  mm/damon/sysfs-schemes.c |  4 +++-
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index df8671e69a70..934c95a7c042 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -105,6 +105,7 @@ struct damon_target {
>   * @DAMOS_NOHUGEPAGE:Call ``madvise()`` for the region with 
> MADV_NOHUGEPAGE.
>   * @DAMOS_LRU_PRIO:  Prioritize the region on its LRU lists.
>   * @DAMOS_LRU_DEPRIO:Deprioritize the region on its LRU lists.
> + * @DAMOS_MIGRATE_HOT:  Migrate for the given hot region.

As commented on the previous patch, this could be bit re-phrased.
Also, let's use tabs consistently.

>   * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.
>   * @DAMOS_STAT:  Do nothing but count the stat.
>   * @NR_DAMOS_ACTIONS:Total number of DAMOS actions
> @@ -123,6 +124,7 @@ enum damos_action {
>   DAMOS_NOHUGEPAGE,
>   DAMOS_LRU_PRIO,
>   DAMOS_LRU_DEPRIO,
> + DAMOS_MIGRATE_HOT,
>   DAMOS_MIGRATE_COLD,
>   DAMOS_STAT, /* Do nothing but only record the stat */
>   NR_DAMOS_ACTIONS,
> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index fe217a26f788..fd9d35b5cc83 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -229,6 +229,7 @@ static bool damos_pa_filter_out(struct damos *scheme, 
> struct folio *folio)
>  
>  enum migration_mode {
>   MIG_PAGEOUT,
> + MIG_MIGRATE_HOT,
>   MIG_MIGRATE_COLD,
>  };

It looks like we don't need MIG_MIGRATE_HOT and MIG_MIGRATE_COLD in real, but
just one, say, MIG_MIGRATE, since the code can know if it should use what
prioritization score function with DAMOS action?

Also, as I commented on the previous one, I'd prefer having DAMOS_ prefix.

>  
> @@ -375,8 +376,10 @@ static unsigned long damon_pa_migrate(struct 
> damon_region *r, struct damos *s,
>   if (damos_pa_filter_out(s, folio))
>   goto put_folio;
>  
> - folio_clear_referenced(folio);
> - folio_test_clear_young(folio);
> + if (mm != MIG_MIGRATE_HOT) {
> + folio_clear_referenced(folio);
> + folio_test_clear_young(folio);
> + }

We agreed to this check via 'young' page type DAMOS filter, and let this
doesn't care about it, right?  If I'm not wrong, I think this should be
removed?

>   if (!folio_isolate_lru(folio))
>   goto put_folio;
>   /*
> @@ -394,6 +397,7 @@ static unsigned long damon_pa_migrate(struct damon_region 
> *r, struct damos *s,
>   case MIG_PAGEOUT:
>   applied = reclaim_pages(_list);
>   break;
> + case MIG_MIGRATE_HOT:
>   case MIG_MIGRATE_COLD:
>   applied = damon_pa_migrate_pages(_list, mm,
>s->target_nid);
> @@ -454,6 +458,8 @@ static unsigned long damon_pa_apply_scheme(struct 
> damon_ctx *ctx,
>   return damon_pa_mark_accessed(r, scheme);
>   case DAMOS_LRU_DEPRIO:
>   return damon_pa_deactivate_pages(r, scheme);
> + case DAMOS_MIGRATE_HOT:
> + return damon_pa_migrate(r, scheme, MIG_MIGRATE_HOT);
>   case DAMOS_MIGRATE_COLD:
>   return damon_pa_migrate(r, scheme, MIG_MIGRATE_COLD);
>   case DAMOS_STAT:
> @@ -476,6 +482,8 @@ static int damon_pa_scheme_score(struct damon_ctx 
> *context,
>   return damon_hot_score(context, r, scheme);
>   case DAMOS_LRU_DEPRIO:
>   return damon_cold_score(context, r, scheme);
> + case DAMOS_MIGRATE_HOT:
> + return damon_hot_score(context, r, scheme);
>   case DAMOS_MIGRATE_COLD:
>   return damon_cold_score(context, r, scheme);
>   default:
> diff 

Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-05 Thread SeongJae Park
On Fri,  5 Apr 2024 16:55:57 +0900 Hyeongtak Ji  wrote:

> On Fri,  5 Apr 2024 15:08:54 +0900 Honggyu Kim  wrote:
> 
> ...snip...
> 
> > +static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
> > +   enum migration_mode mm,
> > +   int target_nid)
> > +{
> > +   int nid;
> > +   unsigned int nr_migrated = 0;
> > +   LIST_HEAD(node_folio_list);
> > +   unsigned int noreclaim_flag;
> > +
> > +   if (list_empty(folio_list))
> > +   return nr_migrated;
> 
> How about checking if `target_nid` is `NUMA_NO_NODE` or not earlier,
> 
> > +
> > +   noreclaim_flag = memalloc_noreclaim_save();
> > +
> > +   nid = folio_nid(lru_to_folio(folio_list));
> > +   do {
> > +   struct folio *folio = lru_to_folio(folio_list);
> > +
> > +   if (nid == folio_nid(folio)) {
> > +   folio_clear_active(folio);
> > +   list_move(>lru, _folio_list);
> > +   continue;
> > +   }
> > +
> > +   nr_migrated += damon_pa_migrate_folio_list(_folio_list,
> > +  NODE_DATA(nid), mm,
> > +  target_nid);
> > +   nid = folio_nid(lru_to_folio(folio_list));
> > +   } while (!list_empty(folio_list));
> > +
> > +   nr_migrated += damon_pa_migrate_folio_list(_folio_list,
> > +  NODE_DATA(nid), mm,
> > +  target_nid);
> > +
> > +   memalloc_noreclaim_restore(noreclaim_flag);
> > +
> > +   return nr_migrated;
> > +}
> > +
> 
> ...snip...
> 
> > +static unsigned int migrate_folio_list(struct list_head *migrate_folios,
> > +  struct pglist_data *pgdat,
> > +  int target_nid)
> > +{
> > +   unsigned int nr_succeeded;
> > +   nodemask_t allowed_mask = NODE_MASK_NONE;
> > +
> > +   struct migration_target_control mtc = {
> > +   /*
> > +* Allocate from 'node', or fail quickly and quietly.
> > +* When this happens, 'page' will likely just be discarded
> > +* instead of migrated.
> > +*/
> > +   .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | 
> > __GFP_NOWARN |
> > +   __GFP_NOMEMALLOC | GFP_NOWAIT,
> > +   .nid = target_nid,
> > +   .nmask = _mask
> > +   };
> > +
> > +   if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
> > +   return 0;
> 
> instead of here.

Agree.  As I replied on the previous reply, I think this check can be done from
the caller (or the caller of the caller) of this function.

> 
> > +
> > +   if (list_empty(migrate_folios))
> > +   return 0;

Same for this.

> > +
> > +   /* Migration ignores all cpuset and mempolicy settings */
> > +   migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
> > + (unsigned long), MIGRATE_ASYNC, MR_DAMON,
> > + _succeeded);
> > +
> > +   return nr_succeeded;
> > +}
> > +
> 
> ...snip...
> 
> Kind regards,
> Hyeongtak
> 

Thanks,
SJ



Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-05 Thread SeongJae Park
On Fri,  5 Apr 2024 15:08:54 +0900 Honggyu Kim  wrote:

> This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
> DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
> instead of swapping them out.
> 
> The 'target_nid' sysfs knob is created by this patch to inform the
> migration target node ID.

Isn't it created by the previous patch?

> 
> Here is one of the example usage of this 'migrate_cold' action.
> 
>   $ cd /sys/kernel/mm/damon/admin/kdamonds/
>   $ cat contexts//schemes//action
>   migrate_cold
>   $ echo 2 > contexts//schemes//target_nid
>   $ echo commit > state
>   $ numactl -p 0 ./hot_cold 500M 600M &
>   $ numastat -c -p hot_cold
> 
>   Per-node process memory usage (in MBs)
>   PID Node 0 Node 1 Node 2 Total
>   --  -- -- -- -
>   701 (hot_cold) 501  0601  1101
> 
> Since there are some common routines with pageout, many functions have
> similar logics between pageout and migrate cold.
> 
> damon_pa_migrate_folio_list() is a minimized version of
> shrink_folio_list(), but it's minified only for demotion.

MIGRATE_COLD is not only for demotion, right?  I think the last two words are
better to be removed for reducing unnecessary confuses.

> 
> Signed-off-by: Honggyu Kim 
> Signed-off-by: Hyeongtak Ji 
> ---
>  include/linux/damon.h|   2 +
>  mm/damon/paddr.c | 146 ++-
>  mm/damon/sysfs-schemes.c |   4 ++
>  3 files changed, 151 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 24ea33a03d5d..df8671e69a70 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -105,6 +105,7 @@ struct damon_target {
>   * @DAMOS_NOHUGEPAGE:Call ``madvise()`` for the region with 
> MADV_NOHUGEPAGE.
>   * @DAMOS_LRU_PRIO:  Prioritize the region on its LRU lists.
>   * @DAMOS_LRU_DEPRIO:Deprioritize the region on its LRU lists.
> + * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.

Whether it will be for cold region or not is depending on the target access
pattern.  What about 'Migrate the regions in coldest regions first manner.'?
Or, simply 'Migrate the regions (prioritize cold)' here, and explain about the
prioritization under quota on the detailed comments part?

Also, let's use tab consistently.

>   * @DAMOS_STAT:  Do nothing but count the stat.
>   * @NR_DAMOS_ACTIONS:Total number of DAMOS actions
>   *
> @@ -122,6 +123,7 @@ enum damos_action {
>   DAMOS_NOHUGEPAGE,
>   DAMOS_LRU_PRIO,
>   DAMOS_LRU_DEPRIO,
> + DAMOS_MIGRATE_COLD,
>   DAMOS_STAT, /* Do nothing but only record the stat */
>   NR_DAMOS_ACTIONS,
>  };
> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index 277a1c4d833c..fe217a26f788 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -12,6 +12,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> +#include 
>  
>  #include "../internal.h"
>  #include "ops-common.h"
> @@ -226,8 +229,137 @@ static bool damos_pa_filter_out(struct damos *scheme, 
> struct folio *folio)
>  
>  enum migration_mode {
>   MIG_PAGEOUT,
> + MIG_MIGRATE_COLD,
>  };
>  
> +static unsigned int migrate_folio_list(struct list_head *migrate_folios,
> +struct pglist_data *pgdat,
> +int target_nid)

To avoid name collisions, I'd prefer having damon_pa_prefix.  I show this patch
is defining damon_pa_migrate_folio_list() below, though.  What about
__damon_pa_migrate_folio_list()?

> +{
> + unsigned int nr_succeeded;
> + nodemask_t allowed_mask = NODE_MASK_NONE;
> +

I personally prefer not having empty lines in the middle of variable
declarations/definitions.  Could we remove this empty line?

> + struct migration_target_control mtc = {
> + /*
> +  * Allocate from 'node', or fail quickly and quietly.
> +  * When this happens, 'page' will likely just be discarded
> +  * instead of migrated.
> +  */
> + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | 
> __GFP_NOWARN |
> + __GFP_NOMEMALLOC | GFP_NOWAIT,
> + .nid = target_nid,
> + .nmask = _mask
> + };
> +
> + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
> + return 0;
> +
> + if (list_empty(migrate_folios))
> + return 0;

Can't these checks be done by the caller?

> +
> + /* Migration ignores all cpuset and mempolicy settings */
> + migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
> +   (unsigned long), MIGRATE_ASYNC, MR_DAMON,
> +   _succeeded);
> +
> + return nr_succeeded;
> +}
> +
> +static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
> + struct pglist_data *pgdat,
> +

Re: [RFC PATCH v3 4/7] mm/migrate: add MR_DAMON to migrate_reason

2024-04-05 Thread SeongJae Park
On Fri,  5 Apr 2024 15:08:53 +0900 Honggyu Kim  wrote:

> The current patch series introduces DAMON based migration across NUMA
> nodes so it'd be better to have a new migrate_reason in trace events.
> 
> Signed-off-by: Honggyu Kim 

Reviewed-by: SeongJae Park 


Thanks,
SJ

> ---
>  include/linux/migrate_mode.h   | 1 +
>  include/trace/events/migrate.h | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
> index f37cc03f9369..cec36b7e7ced 100644
> --- a/include/linux/migrate_mode.h
> +++ b/include/linux/migrate_mode.h
> @@ -29,6 +29,7 @@ enum migrate_reason {
>   MR_CONTIG_RANGE,
>   MR_LONGTERM_PIN,
>   MR_DEMOTION,
> + MR_DAMON,
>   MR_TYPES
>  };
>  
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 0190ef725b43..cd01dd7b3640 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -22,7 +22,8 @@
>   EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
>   EM( MR_CONTIG_RANGE,"contig_range") \
>   EM( MR_LONGTERM_PIN,"longterm_pin") \
> - EMe(MR_DEMOTION,"demotion")
> + EM( MR_DEMOTION,"demotion") \
> + EMe(MR_DAMON,   "damon")
>  
>  /*
>   * First define the enums in the above macros to be exported to userspace
> -- 
> 2.34.1
> 
> 



Re: [RFC PATCH v3 2/7] mm: make alloc_demote_folio externally invokable for migration

2024-04-05 Thread SeongJae Park
On Fri,  5 Apr 2024 15:08:51 +0900 Honggyu Kim  wrote:

> The alloc_demote_folio can be used out of vmscan.c so it'd be better to
> remove static keyword from it.
> 
> This function can also be used for both demotion and promotion so it'd
> be better to rename it from alloc_demote_folio to alloc_migrate_folio.

I'm not sure if renaming is really needed, but has no strong opinion.

> 
> Signed-off-by: Honggyu Kim 

I have one more trivial comment below, but finds no blocker for me.

Reviewed-by: SeongJae Park 

> ---
>  mm/internal.h |  1 +
>  mm/vmscan.c   | 10 +++---
>  2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index f309a010d50f..c96ff9bc82d0 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -866,6 +866,7 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct 
> file *, unsigned long,
>  unsigned long, unsigned long);
>  
>  extern void set_pageblock_order(void);
> +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private);
>  unsigned long reclaim_pages(struct list_head *folio_list);
>  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   struct list_head *folio_list);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4255619a1a31..9e456cac03b4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -910,8 +910,7 @@ static void folio_check_dirty_writeback(struct folio 
> *folio,
>   mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
>  }
>  
> -static struct folio *alloc_demote_folio(struct folio *src,
> - unsigned long private)
> +struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
>  {
>   struct folio *dst;
>   nodemask_t *allowed_mask;
> @@ -935,6 +934,11 @@ static struct folio *alloc_demote_folio(struct folio 
> *src,
>   if (dst)
>   return dst;
>  
> + /*
> +  * Allocation failed from the target node so try to allocate from
> +  * fallback nodes based on allowed_mask.
> +  * See fallback_alloc() at mm/slab.c.
> +  */

I think this might better to be a separate cleanup patch, but given its small
size, I have no strong opinion.

>   mtc->gfp_mask &= ~__GFP_THISNODE;
>   mtc->nmask = allowed_mask;
>  
> @@ -973,7 +977,7 @@ static unsigned int demote_folio_list(struct list_head 
> *demote_folios,
>   node_get_allowed_targets(pgdat, _mask);
>  
>   /* Demotion ignores all cpuset and mempolicy settings */
> - migrate_pages(demote_folios, alloc_demote_folio, NULL,
> + migrate_pages(demote_folios, alloc_migrate_folio, NULL,
> (unsigned long), MIGRATE_ASYNC, MR_DEMOTION,
> _succeeded);
>  
> -- 
> 2.34.1


Thanks,
SJ



Re: [RFC PATCH v3 1/7] mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode

2024-04-05 Thread SeongJae Park
On Fri,  5 Apr 2024 15:08:50 +0900 Honggyu Kim  wrote:

> This is a preparation patch that introduces migration modes.
> 
> The damon_pa_pageout is renamed to damon_pa_migrate and it receives an
> extra argument for migration_mode.

I personally think keeping damon_pa_pageout() as is and adding a new function
(damon_pa_migrate()) with some duplicated code is also ok, but this approach
also looks fine to me.  So I have no strong opinion here, but just letting you
know I would have no objection at both approaches.

> 
> No functional changes applied.
> 
> Signed-off-by: Honggyu Kim 
> ---
>  mm/damon/paddr.c | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
> index 081e2a325778..277a1c4d833c 100644
> --- a/mm/damon/paddr.c
> +++ b/mm/damon/paddr.c
> @@ -224,7 +224,12 @@ static bool damos_pa_filter_out(struct damos *scheme, 
> struct folio *folio)
>   return false;
>  }
>  
> -static unsigned long damon_pa_pageout(struct damon_region *r, struct damos 
> *s)
> +enum migration_mode {
> + MIG_PAGEOUT,
> +};

To avoid name conflicts, what about renaming to 'damos_migration_mode' and
'DAMOS_MIG_PAGEOUT'?

> +
> +static unsigned long damon_pa_migrate(struct damon_region *r, struct damos 
> *s,
> +   enum migration_mode mm)

My poor brain has a bit confused with the name.  What about calling it 'mode'?

>  {
>   unsigned long addr, applied;
>   LIST_HEAD(folio_list);
> @@ -249,7 +254,14 @@ static unsigned long damon_pa_pageout(struct 
> damon_region *r, struct damos *s)
>  put_folio:
>   folio_put(folio);
>   }
> - applied = reclaim_pages(_list);
> + switch (mm) {
> + case MIG_PAGEOUT:
> + applied = reclaim_pages(_list);
> + break;
> + default:
> + /* Unexpected migration mode. */
> + return 0;
> + }
>   cond_resched();
>   return applied * PAGE_SIZE;
>  }
> @@ -297,7 +309,7 @@ static unsigned long damon_pa_apply_scheme(struct 
> damon_ctx *ctx,
>  {
>   switch (scheme->action) {
>   case DAMOS_PAGEOUT:
> - return damon_pa_pageout(r, scheme);
> + return damon_pa_migrate(r, scheme, MIG_PAGEOUT);
>   case DAMOS_LRU_PRIO:
>   return damon_pa_mark_accessed(r, scheme);
>   case DAMOS_LRU_DEPRIO:
> -- 
> 2.34.1


Thanks,
SJ



Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

2024-04-05 Thread Gregory Price
On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> posted at [1].
> 
>   1. YCSB zipfian distribution read only workload
>   memory pressure with cold memory on node0 with 512GB of local DRAM.
>   =++=
>|   cold memory occupied by mmap and memset  |
>|   0G  440G  450G  460G  470G  480G  490G  500G |
>   =++=
>   Execution time normalized to DRAM-only values | GEOMEAN
>   -++-
>   DRAM-only| 1.00 - - - - - - - | 1.00
>   CXL-only | 1.22 - - - - - - - | 1.22
>   default  |-  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17 
>   DAMON tiered |-  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05 
>   =++=
>   CXL usage of redis-server in GB   | AVERAGE
>   -++-
>   DRAM-only|  0.0 - - - - - - - |  0.0
>   CXL-only | 52.6 - - - - - - - | 52.6
>   default  |-  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
>   DAMON tiered |-   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
>   =++=
> 
> Each test result is based on the exeuction environment as follows.
> 
>   DRAM-only   : redis-server uses only local DRAM memory.
>   CXL-only: redis-server uses only CXL memory.
>   default : default memory policy(MPOL_DEFAULT).
> numa balancing disabled.
>   DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
> DAMOS_MIGRATE_HOT for CXL nodes.
> 
> The above result shows the "default" execution time goes up as the size
> of cold memory is increased from 440G to 500G because the more cold
> memory used, the more CXL memory is used for the target redis workload
> and this makes the execution time increase.
> 
> However, "DAMON tiered" result shows less slowdown because the
> DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> cold memory to CXL node and this free space at DRAM increases more
> chance to allocate hot or warm pages of redis-server to fast DRAM node.
> Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> of redis-server to DRAM node actively.
> 
> As a result, it makes more memory of redis-server stay in DRAM node
> compared to "default" memory policy and this makes the performance
> improvement.
> 
> The following result of latest distribution workload shows similar data.
> 
>   2. YCSB latest distribution read only workload
>   memory pressure with cold memory on node0 with 512GB of local DRAM.
>   =++=
>|   cold memory occupied by mmap and memset  |
>|   0G  440G  450G  460G  470G  480G  490G  500G |
>   =++=
>   Execution time normalized to DRAM-only values | GEOMEAN
>   -++-
>   DRAM-only| 1.00 - - - - - - - | 1.00
>   CXL-only | 1.18 - - - - - - - | 1.18
>   default  |-  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
>   DAMON tiered |-  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
>   =++=
>   CXL usage of redis-server in GB   | AVERAGE
>   -++-
>   DRAM-only|  0.0 - - - - - - - |  0.0
>   CXL-only | 52.6 - - - - - - - | 52.6
>   default  |-  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
>   DAMON tiered |-   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
>   =++=
> 
> In summary of both results, our evaluation shows that "DAMON tiered"
> memory management reduces the performance slowdown compared to the
> "default" memory policy from 17~18% to 4~5% when the system runs with
> high memory pressure on its fast tier DRAM nodes.
> 
> Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> tiered memory systems run more efficiently under high memory pressures.
> 

Hi,

It's hard to determine from your results whether the performance
mitigation is being caused primarily by MIGRATE_COLD freeing up space
for new allocations, or from some combination 

[PATCH] ftrace: riscv: move from REGS to ARGS

2024-04-05 Thread Puranjay Mohan
This commit replaces riscv's support for FTRACE_WITH_REGS with support
for FTRACE_WITH_ARGS. This is required for the ongoing effort to stop
relying on stop_machine() for RISCV's implementation of ftrace.

The main relevant benefit that this change will bring for the above
use-case is that now we don't have separate ftrace_caller and
ftrace_regs_caller trampolines. This will allow the callsite to call
ftrace_caller by modifying a single instruction. Now the callsite can
do something similar to:

When not tracing:| When tracing:

func:  func:
  auipc t0, ftrace_caller_topauipc t0, ftrace_caller_top
  nop  <==>  jalr  t0, ftrace_caller_bottom
  [...]  [...]

The above assumes that we are dropping the support of calling a direct
trampoline from the callsite. We need to drop this as the callsite can't
change the target address to call, it can only enable/disable a call to
a preset target (ftrace_caller in the above diagram). We can later optimize
this by calling an intermediate dispatcher trampoline before ftrace_caller.

Currently, ftrace_regs_caller saves all CPU registers in the format of
struct pt_regs and allows the tracer to modify them. We don't need to
save all of the CPU registers because at function entry only a subset of
pt_regs is live:

|--+--+-|
| Register | ABI Name | Description |
|--+--+-|
| x1   | ra   | Return address for traced function  |
| x2   | sp   | Stack pointer   |
| x5   | t0   | Return address for ftrace_caller trampoline |
| x8   | s0/fp| Frame pointer   |
| x10-11   | a0-1 | Function arguments/return values|
| x12-17   | a2-7 | Function arguments  |
|--+--+-|

See RISCV calling convention[1] for the above table.

Saving just the live registers decreases the amount of stack space
required from 288 Bytes to 112 Bytes.

Basic testing was done with this on the VisionFive 2 development board.

Note:
  - Moving from REGS to ARGS will mean that RISCV will stop supporting
KPROBES_ON_FTRACE as it requires full pt_regs to be saved.
  - KPROBES_ON_FTRACE will be supplanted by FPROBES see [2].

[1] https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf
[2] 
https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/

Signed-off-by: Puranjay Mohan 
---
 arch/riscv/Kconfig  |   2 +-
 arch/riscv/include/asm/ftrace.h |  76 --
 arch/riscv/kernel/asm-offsets.c |  18 
 arch/riscv/kernel/ftrace.c  |  18 +---
 arch/riscv/kernel/mcount-dyn.S  | 171 ++--
 include/linux/ftrace.h  |   3 +-
 6 files changed, 145 insertions(+), 143 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index be09c8836d56..da637fde2137 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -126,7 +126,7 @@ config RISCV
select HAVE_DMA_CONTIGUOUS if MMU
select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && MMU && 
(CLANG_SUPPORTS_DYNAMIC_FTRACE || GCC_SUPPORTS_DYNAMIC_FTRACE)
select HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
-   select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
+   select HAVE_DYNAMIC_FTRACE_WITH_ARGS if HAVE_DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_GRAPH_RETVAL if HAVE_FUNCTION_GRAPH_TRACER
diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h
index 1276d7d9ca8b..9eb31a7ea0aa 100644
--- a/arch/riscv/include/asm/ftrace.h
+++ b/arch/riscv/include/asm/ftrace.h
@@ -124,20 +124,82 @@ struct dyn_ftrace;
 int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
 #define ftrace_init_nop ftrace_init_nop
 
-#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
+#define arch_ftrace_get_regs(regs) NULL
 struct ftrace_ops;
-struct ftrace_regs;
+struct ftrace_regs {
+   unsigned long epc;
+   unsigned long ra;
+   unsigned long sp;
+   unsigned long s0;
+   unsigned long t1;
+   union {
+   unsigned long args[8];
+   struct {
+   unsigned long a0;
+   unsigned long a1;
+   unsigned long a2;
+   unsigned long a3;
+   unsigned long a4;
+   unsigned long a5;
+   unsigned long a6;
+   unsigned long a7;
+   };
+   };
+};
+
+static __always_inline unsigned long ftrace_regs_get_instruction_pointer(const 
struct ftrace_regs
+  

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-05 Thread Jonathan Cameron
On Fri,  5 Apr 2024 00:07:06 +
"Ho-Ren (Jack) Chuang"  wrote:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
> 
> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
> 
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
> 
> * Introduce another new lock `default_dram_perf_lock` for adist calculation
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> protect `default_dram_perf_*`. This approach not only avoids deadlock
> but also prevents holding a large lock simultaneously.
> 
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
> 
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> Reviewed-by: "Huang, Ying" 

Hi - one remaining question. Why can't we delay init for all nodes
to either drivers or your fallback late_initcall code.
It would be nice to reduce possible code paths.

Jonathan


> ---
>  mm/memory-tiers.c | 94 +++
>  1 file changed, 70 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 516b144fd45a..6632102bd5c9 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c



> @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
>* For now we can have 4 faster memory tiers with smaller adistance
>* than default DRAM tier.
>*/
> - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> +   _memory_types);
>   if (IS_ERR(default_dram_type))
>   panic("%s() failed to allocate default DRAM tier\n", __func__);
>  
> @@ -865,6 +903,14 @@ static int __init memory_tier_init(void)
>* types assigned.
>*/
>   for_each_node_state(node, N_MEMORY) {
> + if (!node_state(node, N_CPU))
> + /*
> +  * Defer memory tier initialization on
> +  * CPUless numa nodes. These will be initialized
> +  * after firmware and devices are initialized.

Could the comment also say why we can't defer them all?

(In an odd coincidence we have a similar issue for some CPU hotplug
 related bring up where review feedback was move all cases later).

> +  */
> + continue;
> +
>   memtier = set_node_memory_tier(node);
>   if (IS_ERR(memtier))
>   /*




Re: [RFC PATCH] ftrace: riscv: move from REGS to ARGS

2024-04-05 Thread Puranjay Mohan
On Wed, Apr 3, 2024 at 6:02 AM Masami Hiramatsu  wrote:
>
> On Tue, 02 Apr 2024 15:02:41 +0200
> Björn Töpel  wrote:
>
> > Puranjay Mohan  writes:
> >
> > > This commit replaces riscv's support for FTRACE_WITH_REGS with support
> > > for FTRACE_WITH_ARGS. This is required for the ongoing effort to stop
> > > relying on stop_machine() for RISCV's implementation of ftrace.
> > >
> > > The main relevant benefit that this change will bring for the above
> > > use-case is that now we don't have separate ftrace_caller and
> > > ftrace_regs_caller trampolines.  This will allow the callsite to call
> > > ftrace_caller by modifying a single instruction. Now the callsite can
> > > do something similar to:
> > >
> > > When not tracing:| When tracing:
> > >
> > > func:  func:
> > >   auipc t0, ftrace_caller_topauipc t0, ftrace_caller_top
> > >   nop  <==>  jalr  t0, 
> > > ftrace_caller_bottom
> > >   [...]  [...]
> > >
> > > The above assumes that we are dropping the support of calling a direct
> > > trampoline from the callsite. We need to drop this as the callsite can't
> > > change the target address to call, it can only enable/disable a call to
> > > a preset target (ftrace_caller in the above diagram).
> > >
> > > Currently, ftrace_regs_caller saves all CPU registers in the format of
> > > struct pt_regs and allows the tracer to modify them. We don't need to
> > > save all of the CPU registers because at function entry only a subset of
> > > pt_regs is live:
> > >
> > > |--+--+-|
> > > | Register | ABI Name | Description |
> > > |--+--+-|
> > > | x1   | ra   | Return address for traced function  |
> > > | x2   | sp   | Stack pointer   |
> > > | x5   | t0   | Return address for ftrace_caller trampoline |
> > > | x8   | s0/fp| Frame pointer   |
> > > | x10-11   | a0-1 | Function arguments/return values|
> > > | x12-17   | a2-7 | Function arguments  |
> > > |--+--+-|
> > >
> > > See RISCV calling convention[1] for the above table.
> > >
> > > Saving just the live registers decreases the amount of stack space
> > > required from 288 Bytes to 112 Bytes.
> > >
> > > Basic testing was done with this on the VisionFive 2 development board.
> > >
> > > Note:
> > >   - Moving from REGS to ARGS will mean that RISCV will stop supporting
> > > KPROBES_ON_FTRACE as it requires full pt_regs to be saved.
> >
> > ...and FPROBES, but Masami is (still?) working on a series to address
> > that [1].
>
> Yes, even with this patch, FPROBE can be worked with my series.
> So I'm OK for this change.
>
> Thank you,
>
> >
> > My perspective; This is following the work Mark et al has done for
> > arm64, and it does make sense for RISC-V as well. I'm in favor having
> > this change as part of the bigger call_ops/text patch change for RISC-V.
> >
> > [...]
> >
> > > diff --git a/arch/riscv/include/asm/ftrace.h 
> > > b/arch/riscv/include/asm/ftrace.h
> > > index 1276d7d9ca8b..1e95bf4ded6c 100644
> > > --- a/arch/riscv/include/asm/ftrace.h
> > > +++ b/arch/riscv/include/asm/ftrace.h
> > > @@ -124,20 +124,87 @@ struct dyn_ftrace;
> > >  int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
> > >  #define ftrace_init_nop ftrace_init_nop
> > >
> > > -#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> > > +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
> > > +#define arch_ftrace_get_regs(regs) NULL
> > >  struct ftrace_ops;
> > > -struct ftrace_regs;
> > > +struct ftrace_regs {
> > > +   unsigned long epc;
> > > +   unsigned long ra;
> > > +   unsigned long sp;
> > > +   unsigned long s0;
> > > +   unsigned long t1;
> > > +   union {
> > > +   unsigned long args[8];
> > > +   struct {
> > > +   unsigned long a0;
> > > +   unsigned long a1;
> > > +   unsigned long a2;
> > > +   unsigned long a3;
> > > +   unsigned long a4;
> > > +   unsigned long a5;
> > > +   unsigned long a6;
> > > +   unsigned long a7;
> > > +   };
> > > +   };
> > > +};
> > > +
> > > +static __always_inline unsigned long
> > > +ftrace_regs_get_instruction_pointer(const struct ftrace_regs *fregs)
> > > +{
> > > +   return fregs->epc;
> > > +}
> > > +
> > > +static __always_inline void
> > > +ftrace_regs_set_instruction_pointer(struct ftrace_regs *fregs,
> > > +   unsigned long pc)
> > > +{
> > > +   fregs->epc = pc;
> > > +}
> > > +
> > > +static __always_inline unsigned long
> > > +ftrace_regs_get_stack_pointer(const struct 

Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-05 Thread Jonathan Cameron
On Fri,  5 Apr 2024 00:07:05 +
"Ho-Ren (Jack) Chuang"  wrote:

> Since different memory devices require finding, allocating, and putting
> memory types, these common steps are abstracted in this patch,
> enhancing the scalability and conciseness of the code.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Reviewed-by: "Huang, Ying" 
Reviewed-by: Jonathan Cameron 




[PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()

2024-04-05 Thread Paolo Bonzini
With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini 
---
 include/linux/mmu_notifier.h | 2 --
 kernel/events/uprobes.c  | 5 ++---
 mm/ksm.c | 4 ++--
 mm/memory.c  | 7 +--
 mm/migrate_device.c  | 8 ++--
 5 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 8c72bf651606..d39ebb10caeb 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -657,6 +657,4 @@ static inline void mmu_notifier_synchronize(void)
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
-#define set_pte_at_notify set_pte_at
-
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e4834d23e1d1..f4523b95c945 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include /* anon_vma_prepare */
-#include /* set_pte_at_notify */
 #include /* folio_free_swap */
 #include   /* user_enable_single_step */
 #include   /* notifier mechanism */
@@ -195,8 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, 
unsigned long addr,
flush_cache_page(vma, addr, pte_pfn(ptep_get(pvmw.pte)));
ptep_clear_flush(vma, addr, pvmw.pte);
if (new_page)
-   set_pte_at_notify(mm, addr, pvmw.pte,
- mk_pte(new_page, vma->vm_page_prot));
+   set_pte_at(mm, addr, pvmw.pte,
+  mk_pte(new_page, vma->vm_page_prot));
 
folio_remove_rmap_pte(old_folio, old_page, vma);
if (!folio_mapped(old_folio))
diff --git a/mm/ksm.c b/mm/ksm.c
index 8c001819cf10..108a4d167824 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1345,7 +1345,7 @@ static int write_protect_page(struct vm_area_struct *vma, 
struct page *page,
if (pte_write(entry))
entry = pte_wrprotect(entry);
 
-   set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
+   set_pte_at(mm, pvmw.address, pvmw.pte, entry);
}
*orig_pte = entry;
err = 0;
@@ -1447,7 +1447,7 @@ static int replace_page(struct vm_area_struct *vma, 
struct page *page,
 * See Documentation/mm/mmu_notifier.rst
 */
ptep_clear_flush(vma, addr, ptep);
-   set_pte_at_notify(mm, addr, ptep, newpte);
+   set_pte_at(mm, addr, ptep, newpte);
 
folio = page_folio(page);
folio_remove_rmap_pte(folio, page, vma);
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb8..9a6f4d8aa379 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3327,13 +3327,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush(vma, vmf->address, vmf->pte);
folio_add_new_anon_rmap(new_folio, vma, vmf->address);
folio_add_lru_vma(new_folio, vma);
-   /*
-* We call the notify macro here because, when using secondary
-* mmu page tables (such as kvm shadow page tables), we want the
-* new page to be mapped directly into the secondary page table.
-*/
BUG_ON(unshare && pte_write(entry));
-   set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
+   set_pte_at(mm, vmf->address, vmf->pte, entry);
update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
if (old_folio) {
/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index b6c27c76e1a0..66206734b1b9 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -664,13 +664,9 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
if (flush) {
flush_cache_page(vma, addr, pte_pfn(orig_pte));
ptep_clear_flush(vma, addr, ptep);
-   set_pte_at_notify(mm, addr, ptep, entry);
-   update_mmu_cache(vma, addr, ptep);
-   } else {
-   /* No need to invalidate - it was non-present before */
-   set_pte_at(mm, addr, ptep, entry);
-   update_mmu_cache(vma, addr, ptep);
}
+   set_pte_at(mm, addr, ptep, entry);
+   update_mmu_cache(vma, addr, ptep);
 
pte_unmap_unlock(ptep, ptl);
*src = MIGRATE_PFN_MIGRATE;
-- 
2.43.0




[PATCH 0/4] KVM, mm: remove the .change_pte() MMU notifier and set_pte_at_notify()

2024-04-05 Thread Paolo Bonzini
The .change_pte() MMU notifier callback was intended as an optimization
and for this reason it was initially called without a surrounding
mmu_notifier_invalidate_range_{start,end}() pair.  It was only ever
implemented by KVM (which was also the original user of MMU notifiers)
and the rules on when to call set_pte_at_notify() rather than set_pte_at()
have always been pretty obscure.

It may seem a miracle that it has never caused any hard to trigger
bugs, but there's a good reason for that: KVM's implementation has
been nonfunctional for a good part of its existence.  Already in
2012, commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with
invalidate_range_start and invalidate_range_end", 2012-10-09) changed the
.change_pte() callback to occur within an invalidate_range_start/end()
pair; and because KVM unmaps the sPTEs during .invalidate_range_start(),
.change_pte() has no hope of finding a sPTE to change.

Therefore, all the code for .change_pte() can be removed from both KVM
and mm/, and set_pte_at_notify() can be replaced with just set_pte_at().

Please review!  Also feel free to take the KVM patches through the mm
tree, as I don't expect any conflicts.

Thanks,

Paolo

Paolo Bonzini (4):
  KVM: delete .change_pte MMU notifier callback
  KVM: remove unused argument of kvm_handle_hva_range()
  mmu_notifier: remove the .change_pte() callback
  mm: replace set_pte_at_notify() with just set_pte_at()

 arch/arm64/kvm/mmu.c  | 34 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c  | 32 
 arch/mips/kvm/mmu.c   | 30 ---
 arch/powerpc/include/asm/kvm_ppc.h|  1 -
 arch/powerpc/kvm/book3s.c |  5 ---
 arch/powerpc/kvm/book3s.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
 arch/powerpc/kvm/book3s_hv.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c  |  7 
 arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
 arch/riscv/kvm/mmu.c  | 20 --
 arch/x86/kvm/mmu/mmu.c| 54 +--
 arch/x86/kvm/mmu/spte.c   | 16 
 arch/x86/kvm/mmu/spte.h   |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
 arch/x86/kvm/mmu/tdp_mmu.h|  1 -
 include/linux/kvm_host.h  |  2 -
 include/linux/mmu_notifier.h  | 44 --
 include/trace/events/kvm.h| 15 
 kernel/events/uprobes.c   |  5 +--
 mm/ksm.c  |  4 +-
 mm/memory.c   |  7 +---
 mm/migrate_device.c   |  8 +---
 mm/mmu_notifier.c | 17 -
 virt/kvm/kvm_main.c   | 50 +
 26 files changed, 10 insertions(+), 411 deletions(-)

-- 
2.43.0




[PATCH 3/4] mmu_notifier: remove the .change_pte() callback

2024-04-05 Thread Paolo Bonzini
The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini 
---
 include/linux/mmu_notifier.h | 46 ++--
 mm/mmu_notifier.c| 17 -
 2 files changed, 2 insertions(+), 61 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..8c72bf651606 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -122,15 +122,6 @@ struct mmu_notifier_ops {
  struct mm_struct *mm,
  unsigned long address);
 
-   /*
-* change_pte is called in cases that pte mapping to page is changed:
-* for example, when ksm remaps pte to point to a new shared page.
-*/
-   void (*change_pte)(struct mmu_notifier *subscription,
-  struct mm_struct *mm,
-  unsigned long address,
-  pte_t pte);
-
/*
 * invalidate_range_start() and invalidate_range_end() must be
 * paired and are called only when the mmap_lock and/or the
@@ -392,8 +383,6 @@ extern int __mmu_notifier_clear_young(struct mm_struct *mm,
  unsigned long end);
 extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
-extern void __mmu_notifier_change_pte(struct mm_struct *mm,
- unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_arch_invalidate_secondary_tlbs(struct mm_struct *mm,
@@ -439,13 +428,6 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-  unsigned long address, pte_t pte)
-{
-   if (mm_has_notifiers(mm))
-   __mmu_notifier_change_pte(mm, address, pte);
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -581,26 +563,6 @@ static inline void mmu_notifier_range_init_owner(
__young;\
 })
 
-/*
- * set_pte_at_notify() sets the pte _after_ running the notifier.
- * This is safe to start by updating the secondary MMUs, because the primary 
MMU
- * pte invalidate must have already happened with a ptep_clear_flush() before
- * set_pte_at_notify() has been invoked.  Updating the secondary MMUs first is
- * required when we change both the protection of the mapping from read-only to
- * read-write and the pfn (like during copy on write page faults). Otherwise 
the
- * old page would remain mapped readonly in the secondary MMUs after the new
- * page is already writable by some CPU through the primary MMU.
- */
-#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
-({ \
-   struct mm_struct *___mm = __mm; \
-   unsigned long ___address = __address;   \
-   pte_t ___pte = __pte;   \
-   \
-   mmu_notifier_change_pte(___mm, ___address, ___pte); \
-   set_pte_at(___mm, ___address, __ptep, ___pte);  \
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 struct mmu_notifier_range {
@@ -650,11 +612,6 @@ static inline int mmu_notifier_test_young(struct mm_struct 
*mm,
return 0;
 }
 
-static inline void mmu_notifier_change_pte(struct mm_struct *mm,
-  unsigned long address, pte_t pte)
-{
-}
-
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
@@ -693,7 +650,6 @@ static inline void 
mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 #defineptep_clear_flush_notify ptep_clear_flush
 #define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush
 #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
-#define set_pte_at_notify set_pte_at
 
 static inline void mmu_notifier_synchronize(void)
 {
@@ -701,4 +657,6 @@ static 

[PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-05 Thread Paolo Bonzini
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().

Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.

In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.

This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.

Signed-off-by: Paolo Bonzini 
---
 arch/arm64/kvm/mmu.c  | 34 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/mmu.c  | 32 
 arch/mips/kvm/mmu.c   | 30 ---
 arch/powerpc/include/asm/kvm_ppc.h|  1 -
 arch/powerpc/kvm/book3s.c |  5 ---
 arch/powerpc/kvm/book3s.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c   | 12 --
 arch/powerpc/kvm/book3s_hv.c  |  1 -
 arch/powerpc/kvm/book3s_pr.c  |  7 
 arch/powerpc/kvm/e500_mmu_host.c  |  6 ---
 arch/riscv/kvm/mmu.c  | 20 --
 arch/x86/kvm/mmu/mmu.c| 54 +--
 arch/x86/kvm/mmu/spte.c   | 16 
 arch/x86/kvm/mmu/spte.h   |  2 -
 arch/x86/kvm/mmu/tdp_mmu.c| 46 ---
 arch/x86/kvm/mmu/tdp_mmu.h|  1 -
 include/linux/kvm_host.h  |  2 -
 include/trace/events/kvm.h| 15 
 virt/kvm/kvm_main.c   | 43 -
 20 files changed, 2 insertions(+), 327 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dc04bc767865..ff17849be9f4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
return false;
 }
 
-bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
-{
-   kvm_pfn_t pfn = pte_pfn(range->arg.pte);
-
-   if (!kvm->arch.mmu.pgt)
-   return false;
-
-   WARN_ON(range->end - range->start != 1);
-
-   /*
-* If the page isn't tagged, defer to user_mem_abort() for sanitising
-* the MTE tags. The S2 pte should have been unmapped by
-* mmu_notifier_invalidate_range_end().
-*/
-   if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
-   return false;
-
-   /*
-* We've moved a page around, probably through CoW, so let's treat
-* it just like a translation fault and the map handler will clean
-* the cache to the PoC.
-*
-* The MMU notifiers will have unmapped a huge PMD before calling
-* ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
-* therefore we never need to clear out a huge PMD through this
-* calling path and a memcache is not required.
-*/
-   kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
-  PAGE_SIZE, __pfn_to_phys(pfn),
-  KVM_PGTABLE_PROT_R, NULL, 0);
-
-   return false;
-}
-
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
u64 size = (range->end - range->start) << PAGE_SHIFT;
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 2d62f7b0d377..69305441f40d 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -203,7 +203,6 @@ void kvm_flush_tlb_all(void);
 void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
 int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
 
-void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long 
end, bool blockable);
 int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
diff --git a/arch/loongarch/kvm/mmu.c b/arch/loongarch/kvm/mmu.c
index a556cff35740..98883aa23ab8 100644
--- a/arch/loongarch/kvm/mmu.c
+++ b/arch/loongarch/kvm/mmu.c
@@ -494,38 +494,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
kvm_gfn_range *range)
range->end << 

[PATCH 2/4] KVM: remove unused argument of kvm_handle_hva_range()

2024-04-05 Thread Paolo Bonzini
The only user was kvm_mmu_notifier_change_pte(), which is now gone.

Signed-off-by: Paolo Bonzini 
---
 virt/kvm/kvm_main.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2fcd9979752a..970111ad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -595,8 +595,6 @@ static void kvm_null_fn(void)
 }
 #define IS_KVM_NULL_FN(fn) ((fn) == (void *)kvm_null_fn)
 
-static const union kvm_mmu_notifier_arg KVM_MMU_NOTIFIER_NO_ARG;
-
 /* Iterate over each memslot intersecting [start, last] (inclusive) range */
 #define kvm_for_each_memslot_in_hva_range(node, slots, start, last) \
for (node = interval_tree_iter_first(>hva_tree, start, last); \
@@ -682,14 +680,12 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
unsigned long start,
unsigned long end,
-   union kvm_mmu_notifier_arg arg,
gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
-   .arg= arg,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
.flush_on_ret   = true,
@@ -880,8 +876,7 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
mmu_notifier *mn,
 {
trace_kvm_age_hva(start, end);
 
-   return kvm_handle_hva_range(mn, start, end, KVM_MMU_NOTIFIER_NO_ARG,
-   kvm_age_gfn);
+   return kvm_handle_hva_range(mn, start, end, kvm_age_gfn);
 }
 
 static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
-- 
2.43.0





Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-05 Thread Oleg Nesterov
On 04/05, Jiri Olsa wrote:
>
> On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> >
> > I think this expects setjmp/longjmp as below
> >
> > foo() { <- retprobe1
> > setjmp()
> > bar() { <- retprobe2
> > longjmp()
> > }
> > } <- return to trampoline
> >
> > In this case, we need to skip retprobe2's instance.

Yes,

> > My concern is, if we can not find appropriate return instance, what happen?
> > e.g.
> >
> > foo() { <-- retprobe1
> >bar() { # sp is decremented
> >sys_uretprobe() <-- ??
> > }
> > }
> >
> > It seems sys_uretprobe() will handle retprobe1 at that point instead of
> > SIGILL.
>
> yes, and I think it's fine, you get the consumer called in wrong place,
> but it's your fault and kernel won't crash

Agreed.

With or without this patch userpace can also do

foo() { <-- retprobe1
bar() {
jump to xol_area
}
}

handle_trampoline() will handle retprobe1.

> this can be fixed by checking the syscall is called from the trampoline
> and prevent handle_trampoline call if it's not

Yes, but I still do not think this makes a lot of sense. But I won't argue.

And what should sys_uretprobe() do if it is not called from the trampoline?
I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.

I agree very much with Andrii,

   sigreturn()  exists only to allow the implementation of signal handlers. 
 It should never be
   called directly.  Details of the arguments (if any) passed to 
sigreturn() vary depending  on
   the architecture.

this is how sys_uretprobe() should be treated/documented.

sigreturn() can be "improved" too. Say, it could validate sigcontext->ip
and return -EINVAL if this addr is not valid. But why?

Oleg.




Re: [PATCH v3 21/25] drivers: media: i2c: imx258: Use macros

2024-04-05 Thread Luis Garcia
On 4/4/24 00:46, Sakari Ailus wrote:
> On Wed, Apr 03, 2024 at 01:17:26PM -0600, Luigi311 wrote:
>> On 4/3/24 10:23, Sakari Ailus wrote:
>>> Hi Luis,
>>>
>>> On Wed, Apr 03, 2024 at 09:03:50AM -0600, g...@luigi311.com wrote:
 From: Luis Garcia 

 Use understandable macros instead of raw values.

 Signed-off-by: Ondrej Jirman 
 Signed-off-by: Luis Garcia 
 ---
  drivers/media/i2c/imx258.c | 434 ++---
  1 file changed, 207 insertions(+), 227 deletions(-)

 diff --git a/drivers/media/i2c/imx258.c b/drivers/media/i2c/imx258.c
 index e2ecf6109516..30352c33f63c 100644
 --- a/drivers/media/i2c/imx258.c
 +++ b/drivers/media/i2c/imx258.c
 @@ -33,8 +33,6 @@
  #define IMX258_VTS_30FPS_VGA  0x034c
  #define IMX258_VTS_MAX65525
  
 -#define IMX258_REG_VTS0x0340
 -
  /* HBLANK control - read only */
  #define IMX258_PPL_DEFAULT5352
  
 @@ -90,6 +88,53 @@
  #define IMX258_PIXEL_ARRAY_WIDTH  4208U
  #define IMX258_PIXEL_ARRAY_HEIGHT 3120U
  
 +/* regs */
 +#define IMX258_REG_PLL_MULT_DRIV  0x0310
 +#define IMX258_REG_IVTPXCK_DIV0x0301
 +#define IMX258_REG_IVTSYCK_DIV0x0303
 +#define IMX258_REG_PREPLLCK_VT_DIV0x0305
 +#define IMX258_REG_IOPPXCK_DIV0x0309
 +#define IMX258_REG_IOPSYCK_DIV0x030b
 +#define IMX258_REG_PREPLLCK_OP_DIV0x030d
 +#define IMX258_REG_PHASE_PIX_OUTEN0x3030
 +#define IMX258_REG_PDPIX_DATA_RATE0x3032
 +#define IMX258_REG_SCALE_MODE 0x0401
 +#define IMX258_REG_SCALE_MODE_EXT 0x3038
 +#define IMX258_REG_AF_WINDOW_MODE 0x7bcd
 +#define IMX258_REG_FRM_LENGTH_CTL 0x0350
 +#define IMX258_REG_CSI_LANE_MODE  0x0114
 +#define IMX258_REG_X_EVN_INC  0x0381
 +#define IMX258_REG_X_ODD_INC  0x0383
 +#define IMX258_REG_Y_EVN_INC  0x0385
 +#define IMX258_REG_Y_ODD_INC  0x0387
 +#define IMX258_REG_BINNING_MODE   0x0900
 +#define IMX258_REG_BINNING_TYPE_V 0x0901
 +#define IMX258_REG_FORCE_FD_SUM   0x300d
 +#define IMX258_REG_DIG_CROP_X_OFFSET  0x0408
 +#define IMX258_REG_DIG_CROP_Y_OFFSET  0x040a
 +#define IMX258_REG_DIG_CROP_IMAGE_WIDTH   0x040c
 +#define IMX258_REG_DIG_CROP_IMAGE_HEIGHT  0x040e
 +#define IMX258_REG_SCALE_M0x0404
 +#define IMX258_REG_X_OUT_SIZE 0x034c
 +#define IMX258_REG_Y_OUT_SIZE 0x034e
 +#define IMX258_REG_X_ADD_STA  0x0344
 +#define IMX258_REG_Y_ADD_STA  0x0346
 +#define IMX258_REG_X_ADD_END  0x0348
 +#define IMX258_REG_Y_ADD_END  0x034a
 +#define IMX258_REG_EXCK_FREQ  0x0136
 +#define IMX258_REG_CSI_DT_FMT 0x0112
 +#define IMX258_REG_LINE_LENGTH_PCK0x0342
 +#define IMX258_REG_SCALE_M_EXT0x303a
 +#define IMX258_REG_FRM_LENGTH_LINES   0x0340
 +#define IMX258_REG_FINE_INTEG_TIME0x0200
 +#define IMX258_REG_PLL_IVT_MPY0x0306
 +#define IMX258_REG_PLL_IOP_MPY0x030e
 +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_H   0x0820
 +#define IMX258_REG_REQ_LINK_BIT_RATE_MBPS_L   0x0822
 +
 +#define REG8(a, v) { a, v }
 +#define REG16(a, v) { a, ((v) >> 8) & 0xff }, { (a) + 1, (v) & 0xff }
>>>
>>> The patch is nice but these macros are better replaced by the V4L2 CCI
>>> helper that also offers register access functions. Could you add a patch to
>>> convert the driver to use it (maybe after this one)?
>>>
>>
>> Ohh perfect, using something else would be great. Ill go ahead and see
>> if I can get that working.
> 
> Thanks. It may be easier to just do it in this one actually. Up to you.
> 

I've made the swap but looks like its not playing nice with my ppp,
its causing a crash and showing a call trace as soon as it does its
first read to check the identity. I went in and dropped the cci_read
and left it with the original implementation and I'm getting a very
similar crash with cci_write too so it looks like its not liking
how I'm implementing it. Looking at the few other drivers that were
swapped over to use that, I don't seem to be missing anything. It's
a big change so its not really something I can describe what I've
changed but I do have the change on my github here

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-04-05 Thread Honggyu Kim
Hi SeongJae,

On Tue, 26 Mar 2024 16:03:09 -0700 SeongJae Park  wrote:
> On Mon, 25 Mar 2024 15:53:03 -0700 SeongJae Park  wrote:
> 
> > On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim  wrote:
> [...]
> > > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> > > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  
> > > > wrote:
> [...]
> > >
> > > I would like to hear how you think about this.
> > 
> > So, to summarize my humble opinion,
> > 
> > 1. I like the idea of having two actions.  But I'd like to use names other 
> > than
> >'promote' and 'demote'.
> > 2. I still prefer having a filter for the page granularity access re-check.
> > 
> [...]
> > > I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I
> > > can talk more about this issue.
> > 
> > Looking forward to chatting with you :)
> 
> We met and discussed about this topic in the chat series yesterday.  Sharing
> the summary for keeping the open discussion.
> 
> Honggyu thankfully accepted my humble suggestions on the last reply.  Honggyu
> will post the third version of this patchset soon.  The patchset will 
> implement
> two new DAMOS actions, namely MIGRATE_HOT and MIGRATE_COLD.  Those will 
> migrate
> the DAMOS target regions to a user-specified NUMA node, but will have 
> different
> prioritization score function.  As name implies, they will prioritize more hot
> regions and cold regions, respectively.

It's a late answer but thanks very much for the summary.  It was really
helpful discussion in the chat series.

I'm leaving a message so that anyone can follow for the update.  The v3
is posted at 
https://lore.kernel.org/damon/20240405060858.2818-1-honggyu@sk.com.

> Honggyu, please feel free to fix if there is anything wrong or missed.

I don't have anything to fix from your summary.

> And thanks to Honggyu again for patiently keeping this productive discussion
> and their awesome work.

I appreciate your persistent support and kind reviews. 

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]



Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-05 Thread Jiri Olsa
On Fri, Apr 05, 2024 at 10:22:03AM +0900, Masami Hiramatsu wrote:
> On Thu, 4 Apr 2024 18:11:09 +0200
> Oleg Nesterov  wrote:
> 
> > On 04/05, Masami Hiramatsu wrote:
> > >
> > > Can we make this syscall and uprobe behavior clearer? As you said, if
> > > the application use sigreturn or longjump, it may skip returns and
> > > shadow stack entries are left in the kernel. In such cases, can uretprobe
> > > detect it properly, or just crash the process (or process runs wrongly)?
> > 
> > Please see the comment in handle_trampoline(), it tries to detect this case.
> > This patch should not make any difference.
> 
> I think you mean this loop will skip and discard the stacked return_instance
> to find the valid one.
> 
> 
> do {
> /*
>  * We should throw out the frames invalidated by longjmp().
>  * If this chain is valid, then the next one should be alive
>  * or NULL; the latter case means that nobody but ri->func
>  * could hit this trampoline on return. TODO: sigaltstack().
>  */
> next = find_next_ret_chain(ri);
> valid = !next || arch_uretprobe_is_alive(next, RP_CHECK_RET, 
> regs);
> 
> instruction_pointer_set(regs, ri->orig_ret_vaddr);
> do {
> if (valid)
> handle_uretprobe_chain(ri, regs);
> ri = free_ret_instance(ri);
> utask->depth--;
> } while (ri != next);
> } while (!valid);
> 
> 
> I think this expects setjmp/longjmp as below
> 
> foo() { <- retprobe1
>   setjmp()
>   bar() { <- retprobe2
>   longjmp()
>   }
> } <- return to trampoline
> 
> In this case, we need to skip retprobe2's instance.
> My concern is, if we can not find appropriate return instance, what happen?
> e.g.
> 
> foo() { <-- retprobe1
>bar() { # sp is decremented
>sys_uretprobe() <-- ??
> }
> }
> 
> It seems sys_uretprobe() will handle retprobe1 at that point instead of
> SIGILL.

yes, and I think it's fine, you get the consumer called in wrong place,
but it's your fault and kernel won't crash

this can be fixed by checking the syscall is called from the trampoline
and prevent handle_trampoline call if it's not

> 
> Can we avoid this with below strict check?
> 
> if (ri->stack != regs->sp + expected_offset)
>   goto sigill;

hm the current uprobe 'alive' check makes sure the return_instance is above
or at the same stack address, not sure we can match it exactly, need to think
about that more

> 
> expected_offset should be 16 (push * 3 - ret) on x64 if we ri->stack is the
> regs->sp right after call.

the syscall trampoline already updates the regs->sp before calling
handle_trampoline

regs->sp += sizeof(r11_cx_ax);

jirka



Re: [PATCH 00/64] i2c: reword i2c_algorithm according to newest specification

2024-04-05 Thread Wolfram Sang
Hi Andi, hi everyone,

thank you for reviewing and waiting. I had a small personal hiatus over
Easter but now I am back. This series needs another cycle, so no need to
hurry. I will address some of the review comments but not all. The
conversion (and API improvements) are some bigger tasks, so
inconsistencies inbetween can't be avoided AFAICS.

I'll keep you updated.

Happy hacking,

   Wolfram



signature.asc
Description: PGP signature


Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-05 Thread Hyeongtak Ji
On Fri,  5 Apr 2024 15:08:54 +0900 Honggyu Kim  wrote:

...snip...

> +static unsigned long damon_pa_migrate_pages(struct list_head *folio_list,
> + enum migration_mode mm,
> + int target_nid)
> +{
> + int nid;
> + unsigned int nr_migrated = 0;
> + LIST_HEAD(node_folio_list);
> + unsigned int noreclaim_flag;
> +
> + if (list_empty(folio_list))
> + return nr_migrated;

How about checking if `target_nid` is `NUMA_NO_NODE` or not earlier,

> +
> + noreclaim_flag = memalloc_noreclaim_save();
> +
> + nid = folio_nid(lru_to_folio(folio_list));
> + do {
> + struct folio *folio = lru_to_folio(folio_list);
> +
> + if (nid == folio_nid(folio)) {
> + folio_clear_active(folio);
> + list_move(>lru, _folio_list);
> + continue;
> + }
> +
> + nr_migrated += damon_pa_migrate_folio_list(_folio_list,
> +NODE_DATA(nid), mm,
> +target_nid);
> + nid = folio_nid(lru_to_folio(folio_list));
> + } while (!list_empty(folio_list));
> +
> + nr_migrated += damon_pa_migrate_folio_list(_folio_list,
> +NODE_DATA(nid), mm,
> +target_nid);
> +
> + memalloc_noreclaim_restore(noreclaim_flag);
> +
> + return nr_migrated;
> +}
> +

...snip...

> +static unsigned int migrate_folio_list(struct list_head *migrate_folios,
> +struct pglist_data *pgdat,
> +int target_nid)
> +{
> + unsigned int nr_succeeded;
> + nodemask_t allowed_mask = NODE_MASK_NONE;
> +
> + struct migration_target_control mtc = {
> + /*
> +  * Allocate from 'node', or fail quickly and quietly.
> +  * When this happens, 'page' will likely just be discarded
> +  * instead of migrated.
> +  */
> + .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | 
> __GFP_NOWARN |
> + __GFP_NOMEMALLOC | GFP_NOWAIT,
> + .nid = target_nid,
> + .nmask = _mask
> + };
> +
> + if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
> + return 0;

instead of here.

> +
> + if (list_empty(migrate_folios))
> + return 0;
> +
> + /* Migration ignores all cpuset and mempolicy settings */
> + migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
> +   (unsigned long), MIGRATE_ASYNC, MR_DAMON,
> +   _succeeded);
> +
> + return nr_succeeded;
> +}
> +

...snip...

Kind regards,
Hyeongtak



[RFC PATCH v3 6/7] mm/damon/paddr: introduce DAMOS_MIGRATE_HOT action for promotion

2024-04-05 Thread Honggyu Kim
From: Hyeongtak Ji 

This patch introduces DAMOS_MIGRATE_HOT action, which is similar to
DAMOS_MIGRATE_COLD, but it is targeted to migrate hot pages.

It migrates pages inside the given region to the 'target_nid' NUMA node
in the sysfs.

Here is one of the example usage of this 'migrate_hot' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/
  $ cat contexts//schemes//action
  migrate_hot
  $ echo 0 > contexts//schemes//target_nid
  $ echo commit > state
  $ numactl -p 2 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  701 (hot_cold) 501  0601  1101

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
---
 include/linux/damon.h|  2 ++
 mm/damon/paddr.c | 12 ++--
 mm/damon/sysfs-schemes.c |  4 +++-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index df8671e69a70..934c95a7c042 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_HOT:  Migrate for the given hot region.
  * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
@@ -123,6 +124,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_MIGRATE_HOT,
DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index fe217a26f788..fd9d35b5cc83 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -229,6 +229,7 @@ static bool damos_pa_filter_out(struct damos *scheme, 
struct folio *folio)
 
 enum migration_mode {
MIG_PAGEOUT,
+   MIG_MIGRATE_HOT,
MIG_MIGRATE_COLD,
 };
 
@@ -375,8 +376,10 @@ static unsigned long damon_pa_migrate(struct damon_region 
*r, struct damos *s,
if (damos_pa_filter_out(s, folio))
goto put_folio;
 
-   folio_clear_referenced(folio);
-   folio_test_clear_young(folio);
+   if (mm != MIG_MIGRATE_HOT) {
+   folio_clear_referenced(folio);
+   folio_test_clear_young(folio);
+   }
if (!folio_isolate_lru(folio))
goto put_folio;
/*
@@ -394,6 +397,7 @@ static unsigned long damon_pa_migrate(struct damon_region 
*r, struct damos *s,
case MIG_PAGEOUT:
applied = reclaim_pages(_list);
break;
+   case MIG_MIGRATE_HOT:
case MIG_MIGRATE_COLD:
applied = damon_pa_migrate_pages(_list, mm,
 s->target_nid);
@@ -454,6 +458,8 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx 
*ctx,
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_pa_deactivate_pages(r, scheme);
+   case DAMOS_MIGRATE_HOT:
+   return damon_pa_migrate(r, scheme, MIG_MIGRATE_HOT);
case DAMOS_MIGRATE_COLD:
return damon_pa_migrate(r, scheme, MIG_MIGRATE_COLD);
case DAMOS_STAT:
@@ -476,6 +482,8 @@ static int damon_pa_scheme_score(struct damon_ctx *context,
return damon_hot_score(context, r, scheme);
case DAMOS_LRU_DEPRIO:
return damon_cold_score(context, r, scheme);
+   case DAMOS_MIGRATE_HOT:
+   return damon_hot_score(context, r, scheme);
case DAMOS_MIGRATE_COLD:
return damon_cold_score(context, r, scheme);
default:
diff --git a/mm/damon/sysfs-schemes.c b/mm/damon/sysfs-schemes.c
index 18b7d054c748..1d2f62aa79ca 100644
--- a/mm/damon/sysfs-schemes.c
+++ b/mm/damon/sysfs-schemes.c
@@ -1406,6 +1406,7 @@ static const char * const damon_sysfs_damos_action_strs[] 
= {
"nohugepage",
"lru_prio",
"lru_deprio",
+   "migrate_hot",
"migrate_cold",
"stat",
 };
@@ -1660,7 +1661,8 @@ static ssize_t target_nid_store(struct kobject *kobj,
struct damon_sysfs_scheme, kobj);
int err = 0;
 
-if (scheme->action != DAMOS_MIGRATE_COLD)
+if (scheme->action != DAMOS_MIGRATE_HOT &&
+scheme->action != DAMOS_MIGRATE_COLD)
 return -EINVAL;
 
/* TODO: error handling for target_nid range. */
-- 
2.34.1




[RFC PATCH v3 7/7] mm/damon: Add "damon_migrate_{hot,cold}" vmstat

2024-04-05 Thread Honggyu Kim
This patch adds "damon_migrate_{hot,cold}" under node specific vmstat
counters at the following location.

  /sys/devices/system/node/node*/vmstat

The counted values are accumulcated to the global vmstat so it also
introduces the same counter at /proc/vmstat as well.

Signed-off-by: Honggyu Kim 
---
 include/linux/mmzone.h |  4 
 mm/damon/paddr.c   | 17 -
 mm/vmstat.c|  4 
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a497f189d988..0005372c5503 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -214,6 +214,10 @@ enum node_stat_item {
PGDEMOTE_KSWAPD,
PGDEMOTE_DIRECT,
PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_DAMON_PADDR
+   DAMON_MIGRATE_HOT,
+   DAMON_MIGRATE_COLD,
+#endif
NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index fd9d35b5cc83..d559c242d151 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -235,10 +235,23 @@ enum migration_mode {
 
 static unsigned int migrate_folio_list(struct list_head *migrate_folios,
   struct pglist_data *pgdat,
+  enum migration_mode mm,
   int target_nid)
 {
unsigned int nr_succeeded;
nodemask_t allowed_mask = NODE_MASK_NONE;
+   enum node_stat_item node_stat;
+
+   switch (mm) {
+   case MIG_MIGRATE_HOT:
+   node_stat = DAMON_MIGRATE_HOT;
+   break;
+   case MIG_MIGRATE_COLD:
+   node_stat = DAMON_MIGRATE_COLD;
+   break;
+   default:
+   return 0;
+   }
 
struct migration_target_control mtc = {
/*
@@ -263,6 +276,8 @@ static unsigned int migrate_folio_list(struct list_head 
*migrate_folios,
  (unsigned long), MIGRATE_ASYNC, MR_DAMON,
  _succeeded);
 
+   mod_node_page_state(pgdat, node_stat, nr_succeeded);
+
return nr_succeeded;
 }
 
@@ -302,7 +317,7 @@ static unsigned int damon_pa_migrate_folio_list(struct 
list_head *folio_list,
/* 'folio_list' is always empty here */
 
/* Migrate folios selected for migration */
-   nr_migrated += migrate_folio_list(_folios, pgdat, target_nid);
+   nr_migrated += migrate_folio_list(_folios, pgdat, mm, 
target_nid);
/* Folios that could not be migrated are still in @migrate_folios */
if (!list_empty(_folios)) {
/* Folios which weren't migrated go back on @folio_list */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..be9ba89fede1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1252,6 +1252,10 @@ const char * const vmstat_text[] = {
"pgdemote_kswapd",
"pgdemote_direct",
"pgdemote_khugepaged",
+#ifdef CONFIG_DAMON_PADDR
+   "damon_migrate_hot",
+   "damon_migrate_cold",
+#endif
 
/* enum writeback_stat_item counters */
"nr_dirty_threshold",
-- 
2.34.1




[RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-05 Thread Honggyu Kim
This patch introduces DAMOS_MIGRATE_COLD action, which is similar to
DAMOS_PAGEOUT, but migrate folios to the given 'target_nid' in the sysfs
instead of swapping them out.

The 'target_nid' sysfs knob is created by this patch to inform the
migration target node ID.

Here is one of the example usage of this 'migrate_cold' action.

  $ cd /sys/kernel/mm/damon/admin/kdamonds/
  $ cat contexts//schemes//action
  migrate_cold
  $ echo 2 > contexts//schemes//target_nid
  $ echo commit > state
  $ numactl -p 0 ./hot_cold 500M 600M &
  $ numastat -c -p hot_cold

  Per-node process memory usage (in MBs)
  PID Node 0 Node 1 Node 2 Total
  --  -- -- -- -
  701 (hot_cold) 501  0601  1101

Since there are some common routines with pageout, many functions have
similar logics between pageout and migrate cold.

damon_pa_migrate_folio_list() is a minimized version of
shrink_folio_list(), but it's minified only for demotion.

Signed-off-by: Honggyu Kim 
Signed-off-by: Hyeongtak Ji 
---
 include/linux/damon.h|   2 +
 mm/damon/paddr.c | 146 ++-
 mm/damon/sysfs-schemes.c |   4 ++
 3 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 24ea33a03d5d..df8671e69a70 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -105,6 +105,7 @@ struct damon_target {
  * @DAMOS_NOHUGEPAGE:  Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
  * @DAMOS_LRU_PRIO:Prioritize the region on its LRU lists.
  * @DAMOS_LRU_DEPRIO:  Deprioritize the region on its LRU lists.
+ * @DAMOS_MIGRATE_COLD: Migrate for the given cold region.
  * @DAMOS_STAT:Do nothing but count the stat.
  * @NR_DAMOS_ACTIONS:  Total number of DAMOS actions
  *
@@ -122,6 +123,7 @@ enum damos_action {
DAMOS_NOHUGEPAGE,
DAMOS_LRU_PRIO,
DAMOS_LRU_DEPRIO,
+   DAMOS_MIGRATE_COLD,
DAMOS_STAT, /* Do nothing but only record the stat */
NR_DAMOS_ACTIONS,
 };
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 277a1c4d833c..fe217a26f788 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -12,6 +12,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "../internal.h"
 #include "ops-common.h"
@@ -226,8 +229,137 @@ static bool damos_pa_filter_out(struct damos *scheme, 
struct folio *folio)
 
 enum migration_mode {
MIG_PAGEOUT,
+   MIG_MIGRATE_COLD,
 };
 
+static unsigned int migrate_folio_list(struct list_head *migrate_folios,
+  struct pglist_data *pgdat,
+  int target_nid)
+{
+   unsigned int nr_succeeded;
+   nodemask_t allowed_mask = NODE_MASK_NONE;
+
+   struct migration_target_control mtc = {
+   /*
+* Allocate from 'node', or fail quickly and quietly.
+* When this happens, 'page' will likely just be discarded
+* instead of migrated.
+*/
+   .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | 
__GFP_NOWARN |
+   __GFP_NOMEMALLOC | GFP_NOWAIT,
+   .nid = target_nid,
+   .nmask = _mask
+   };
+
+   if (pgdat->node_id == target_nid || target_nid == NUMA_NO_NODE)
+   return 0;
+
+   if (list_empty(migrate_folios))
+   return 0;
+
+   /* Migration ignores all cpuset and mempolicy settings */
+   migrate_pages(migrate_folios, alloc_migrate_folio, NULL,
+ (unsigned long), MIGRATE_ASYNC, MR_DAMON,
+ _succeeded);
+
+   return nr_succeeded;
+}
+
+static unsigned int damon_pa_migrate_folio_list(struct list_head *folio_list,
+   struct pglist_data *pgdat,
+   enum migration_mode mm,
+   int target_nid)
+{
+   unsigned int nr_migrated = 0;
+   struct folio *folio;
+   LIST_HEAD(ret_folios);
+   LIST_HEAD(migrate_folios);
+
+   cond_resched();
+
+   while (!list_empty(folio_list)) {
+   struct folio *folio;
+
+   cond_resched();
+
+   folio = lru_to_folio(folio_list);
+   list_del(>lru);
+
+   if (!folio_trylock(folio))
+   goto keep;
+
+   VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+
+   /* Relocate its contents to another node. */
+   list_add(>lru, _folios);
+   folio_unlock(folio);
+   continue;
+keep:
+   list_add(>lru, _folios);
+   VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+   }
+   /* 'folio_list' is always empty here */
+
+   /* Migrate folios selected for migration */
+   nr_migrated += migrate_folio_list(_folios, pgdat, target_nid);
+   /* 

[RFC PATCH v3 3/7] mm/damon/sysfs-schemes: add target_nid on sysfs-schemes

2024-04-05 Thread Honggyu Kim
From: Hyeongtak Ji 

This patch adds target_nid under
  /sys/kernel/mm/damon/admin/kdamonds//contexts//schemes//

The 'target_nid' can be used as the destination node for DAMOS actions
such as DAMOS_MIGRATE_{HOT,COLD} in the follow up patches.

Signed-off-by: Hyeongtak Ji 
Signed-off-by: Honggyu Kim 
---
 include/linux/damon.h| 11 ++-
 mm/damon/core.c  |  5 -
 mm/damon/dbgfs.c |  2 +-
 mm/damon/lru_sort.c  |  3 ++-
 mm/damon/reclaim.c   |  3 ++-
 mm/damon/sysfs-schemes.c | 33 -
 6 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 5881e4ac30be..24ea33a03d5d 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -337,6 +337,7 @@ struct damos_access_pattern {
  * @apply_interval_us: The time between applying the @action.
  * @quota: Control the aggressiveness of this scheme.
  * @wmarks:Watermarks for automated (in)activation of this scheme.
+ * @target_nid:Destination node if @action is 
"migrate_{hot,cold}".
  * @filters:   Additional set of  damos_filter for 
  * @stat:  Statistics of this scheme.
  * @list:  List head for siblings.
@@ -352,6 +353,10 @@ struct damos_access_pattern {
  * monitoring context are inactive, DAMON stops monitoring either, and just
  * repeatedly checks the watermarks.
  *
+ * @target_nid is used to set the migration target node for migrate_hot or
+ * migrate_cold actions, which means it's only meaningful when @action is 
either
+ * "migrate_hot" or "migrate_cold".
+ *
  * Before applying the  to a memory region,  damon_operations
  * implementation could check pages of the region and skip  to respect
  * 
@@ -373,6 +378,9 @@ struct damos {
 /* public: */
struct damos_quota quota;
struct damos_watermarks wmarks;
+   union {
+   int target_nid;
+   };
struct list_head filters;
struct damos_stat stat;
struct list_head list;
@@ -677,7 +685,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks);
+   struct damos_watermarks *wmarks,
+   int target_nid);
 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
 void damon_destroy_scheme(struct damos *s);
 
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 5b325749fc12..7ff0259d9fa6 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -316,7 +316,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
enum damos_action action,
unsigned long apply_interval_us,
struct damos_quota *quota,
-   struct damos_watermarks *wmarks)
+   struct damos_watermarks *wmarks,
+   int target_nid)
 {
struct damos *scheme;
 
@@ -341,6 +342,8 @@ struct damos *damon_new_scheme(struct damos_access_pattern 
*pattern,
scheme->wmarks = *wmarks;
scheme->wmarks.activated = true;
 
+   scheme->target_nid = target_nid;
+
return scheme;
 }
 
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index 7dac24e69e3b..d04fdccfa65b 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -279,7 +279,7 @@ static struct damos **str_to_schemes(const char *str, 
ssize_t len,
 
pos += parsed;
scheme = damon_new_scheme(, action, 0, ,
-   );
+   , NUMA_NO_NODE);
if (!scheme)
goto fail;
 
diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c
index 3de2916a65c3..3775f0f2743d 100644
--- a/mm/damon/lru_sort.c
+++ b/mm/damon/lru_sort.c
@@ -163,7 +163,8 @@ static struct damos *damon_lru_sort_new_scheme(
/* under the quota. */
,
/* (De)activate this according to the watermarks. */
-   _lru_sort_wmarks);
+   _lru_sort_wmarks,
+   NUMA_NO_NODE);
 }
 
 /* Create a DAMON-based operation scheme for hot memory regions */
diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c
index 66e190f0374a..84e6e96b5dcc 100644
--- a/mm/damon/reclaim.c
+++ b/mm/damon/reclaim.c
@@ -147,7 +147,8 @@ static struct damos *damon_reclaim_new_scheme(void)
/* under the quota. */
_reclaim_quota,
/* (De)activate this according to the watermarks. */
-   _reclaim_wmarks);
+   _reclaim_wmarks,
+   NUMA_NO_NODE);
 }
 
 static void damon_reclaim_copy_quota_status(struct damos_quota *dst,
diff --git 

[RFC PATCH v3 4/7] mm/migrate: add MR_DAMON to migrate_reason

2024-04-05 Thread Honggyu Kim
The current patch series introduces DAMON based migration across NUMA
nodes so it'd be better to have a new migrate_reason in trace events.

Signed-off-by: Honggyu Kim 
---
 include/linux/migrate_mode.h   | 1 +
 include/trace/events/migrate.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..cec36b7e7ced 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+   MR_DAMON,
MR_TYPES
 };
 
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 0190ef725b43..cd01dd7b3640 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -22,7 +22,8 @@
EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
EM( MR_CONTIG_RANGE,"contig_range") \
EM( MR_LONGTERM_PIN,"longterm_pin") \
-   EMe(MR_DEMOTION,"demotion")
+   EM( MR_DEMOTION,"demotion") \
+   EMe(MR_DAMON,   "damon")
 
 /*
  * First define the enums in the above macros to be exported to userspace
-- 
2.34.1




Re: [PATCH] [v3] module: don't ignore sysfs_create_link() failures

2024-04-05 Thread Greg Kroah-Hartman
On Tue, Mar 26, 2024 at 03:57:18PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The sysfs_create_link() return code is marked as __must_check, but the
> module_add_driver() function tries hard to not care, by assigning the
> return code to a variable. When building with 'make W=1', gcc still
> warns because this variable is only assigned but not used:
> 
> drivers/base/module.c: In function 'module_add_driver':
> drivers/base/module.c:36:6: warning: variable 'no_warn' set but not used 
> [-Wunused-but-set-variable]
> 
> Rework the code to properly unwind and return the error code to the
> caller. My reading of the original code was that it tries to
> not fail when the links already exist, so keep ignoring -EEXIST
> errors.
> 
> Cc: Luis Chamberlain 
> Cc: linux-modu...@vger.kernel.org
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Fixes: e17e0f51aeea ("Driver core: show drivers in /sys/module/")
> See-also: 4a7fb6363f2d ("add __must_check to device management code")
> Signed-off-by: Arnd Bergmann 
> ---
> v3: make error handling stricter, add unwinding,
>  fix build fail with CONFIG_MODULES=n
> v2: rework to actually handle the error. I have not tested the
> error handling beyond build testing, so please review carefully.
> ---
>  drivers/base/base.h   |  9 ++---
>  drivers/base/bus.c|  9 -
>  drivers/base/module.c | 42 +++---
>  3 files changed, 45 insertions(+), 15 deletions(-)

Reviewed-by: Greg Kroah-Hartman 



[RFC PATCH v3 1/7] mm/damon/paddr: refactor DAMOS_PAGEOUT with migration_mode

2024-04-05 Thread Honggyu Kim
This is a preparation patch that introduces migration modes.

The damon_pa_pageout is renamed to damon_pa_migrate and it receives an
extra argument for migration_mode.

No functional changes applied.

Signed-off-by: Honggyu Kim 
---
 mm/damon/paddr.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 081e2a325778..277a1c4d833c 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -224,7 +224,12 @@ static bool damos_pa_filter_out(struct damos *scheme, 
struct folio *folio)
return false;
 }
 
-static unsigned long damon_pa_pageout(struct damon_region *r, struct damos *s)
+enum migration_mode {
+   MIG_PAGEOUT,
+};
+
+static unsigned long damon_pa_migrate(struct damon_region *r, struct damos *s,
+ enum migration_mode mm)
 {
unsigned long addr, applied;
LIST_HEAD(folio_list);
@@ -249,7 +254,14 @@ static unsigned long damon_pa_pageout(struct damon_region 
*r, struct damos *s)
 put_folio:
folio_put(folio);
}
-   applied = reclaim_pages(_list);
+   switch (mm) {
+   case MIG_PAGEOUT:
+   applied = reclaim_pages(_list);
+   break;
+   default:
+   /* Unexpected migration mode. */
+   return 0;
+   }
cond_resched();
return applied * PAGE_SIZE;
 }
@@ -297,7 +309,7 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx 
*ctx,
 {
switch (scheme->action) {
case DAMOS_PAGEOUT:
-   return damon_pa_pageout(r, scheme);
+   return damon_pa_migrate(r, scheme, MIG_PAGEOUT);
case DAMOS_LRU_PRIO:
return damon_pa_mark_accessed(r, scheme);
case DAMOS_LRU_DEPRIO:
-- 
2.34.1




[RFC PATCH v3 2/7] mm: make alloc_demote_folio externally invokable for migration

2024-04-05 Thread Honggyu Kim
The alloc_demote_folio can be used out of vmscan.c so it'd be better to
remove static keyword from it.

This function can also be used for both demotion and promotion so it'd
be better to rename it from alloc_demote_folio to alloc_migrate_folio.

Signed-off-by: Honggyu Kim 
---
 mm/internal.h |  1 +
 mm/vmscan.c   | 10 +++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..c96ff9bc82d0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -866,6 +866,7 @@ extern unsigned long  __must_check vm_mmap_pgoff(struct 
file *, unsigned long,
 unsigned long, unsigned long);
 
 extern void set_pageblock_order(void);
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private);
 unsigned long reclaim_pages(struct list_head *folio_list);
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4255619a1a31..9e456cac03b4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -910,8 +910,7 @@ static void folio_check_dirty_writeback(struct folio *folio,
mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct folio *alloc_demote_folio(struct folio *src,
-   unsigned long private)
+struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
 {
struct folio *dst;
nodemask_t *allowed_mask;
@@ -935,6 +934,11 @@ static struct folio *alloc_demote_folio(struct folio *src,
if (dst)
return dst;
 
+   /*
+* Allocation failed from the target node so try to allocate from
+* fallback nodes based on allowed_mask.
+* See fallback_alloc() at mm/slab.c.
+*/
mtc->gfp_mask &= ~__GFP_THISNODE;
mtc->nmask = allowed_mask;
 
@@ -973,7 +977,7 @@ static unsigned int demote_folio_list(struct list_head 
*demote_folios,
node_get_allowed_targets(pgdat, _mask);
 
/* Demotion ignores all cpuset and mempolicy settings */
-   migrate_pages(demote_folios, alloc_demote_folio, NULL,
+   migrate_pages(demote_folios, alloc_migrate_folio, NULL,
  (unsigned long), MIGRATE_ASYNC, MR_DEMOTION,
  _succeeded);
 
-- 
2.34.1




[RFC PATCH v3 0/7] DAMON based tiered memory management for CXL memory

2024-04-05 Thread Honggyu Kim
There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
posted at [1].

It says there is no implementation of the demote/promote DAMOS action
are made.  This RFC is about its implementation for physical address
space.


Changes from RFC v2:
  1. Rename DAMOS_{PROMOTE,DEMOTE} actions to DAMOS_MIGRATE_{HOT,COLD}.
  2. Create 'target_nid' to set the migration target node instead of
 depending on node distance based information.
  3. Instead of having page level access check in this patch series,
 delegate the job to a new DAMOS filter type YOUNG[2].
  4. Introduce vmstat counters "damon_migrate_{hot,cold}".
  5. Rebase from v6.7 to v6.8.

Changes from RFC:
  1. Move most of implementation from mm/vmscan.c to mm/damon/paddr.c.
  2. Simplify some functions of vmscan.c and used in paddr.c, but need
 to be reviewed more in depth.
  3. Refactor most functions for common usage for both promote and
 demote actions and introduce an enum migration_mode for its control.
  4. Add "target_nid" sysfs knob for migration destination node for both
 promote and demote actions.
  5. Move DAMOS_PROMOTE before DAMOS_DEMOTE and move then even above
 DAMOS_STAT.


Introduction


With the advent of CXL/PCIe attached DRAM, which will be called simply
as CXL memory in this cover letter, some systems are becoming more
heterogeneous having memory systems with different latency and bandwidth
characteristics.  They are usually handled as different NUMA nodes in
separate memory tiers and CXL memory is used as slow tiers because of
its protocol overhead compared to local DRAM.

In this kind of systems, we need to be careful placing memory pages on
proper NUMA nodes based on the memory access frequency.  Otherwise, some
frequently accessed pages might reside on slow tiers and it makes
performance degradation unexpectedly.  Moreover, the memory access
patterns can be changed at runtime.

To handle this problem, we need a way to monitor the memory access
patterns and migrate pages based on their access temperature.  The
DAMON(Data Access MONitor) framework and its DAMOS(DAMON-based Operation
Schemes) can be useful features for monitoring and migrating pages.
DAMOS provides multiple actions based on DAMON monitoring results and it
can be used for proactive reclaim, which means swapping cold pages out
with DAMOS_PAGEOUT action, but it doesn't support migration actions such
as demotion and promotion between tiered memory nodes.

This series supports two new DAMOS actions; DAMOS_MIGRATE_HOT for
promotion from slow tiers and DAMOS_MIGRATE_COLD for demotion from fast
tiers.  This prevents hot pages from being stuck on slow tiers, which
makes performance degradation and cold pages can be proactively demoted
to slow tiers so that the system can increase the chance to allocate
more hot pages to fast tiers.

The DAMON provides various tuning knobs but we found that the proactive
demotion for cold pages is especially useful when the system is running
out of memory on its fast tier nodes.

Our evaluation result shows that it reduces the performance slowdown
compared to the default memory policy from 17~18% to 4~5% when the
system runs under high memory pressure on its fast tier DRAM nodes.


DAMON configuration
===

The specific DAMON configuration doesn't have to be in the scope of this
patch series, but some rough idea is better to be shared to explain the
evaluation result.

The DAMON provides many knobs for fine tuning but its configuration file
is generated by HMSDK[3].  It includes gen_config.py script that
generates a json file with the full config of DAMON knobs and it creates
multiple kdamonds for each NUMA node when the DAMON is enabled so that
it can run hot/cold based migration for tiered memory.


Evaluation Workload
===

The performance evaluation is done with redis[4], which is a widely used
in-memory database and the memory access patterns are generated via
YCSB[5].  We have measured two different workloads with zipfian and
latest distributions but their configs are slightly modified to make
memory usage higher and execution time longer for better evaluation.

The idea of evaluation using these migrate_{hot,cold} actions covers
system-wide memory management rather than partitioning hot/cold pages of
a single workload.  The default memory allocation policy creates pages
to the fast tier DRAM node first, then allocates newly created pages to
the slow tier CXL node when the DRAM node has insufficient free space.
Once the page allocation is done then those pages never move between
NUMA nodes.  It's not true when using numa balancing, but it is not the
scope of this DAMON based tiered memory management support.

If the working set of redis can be fit fully into the DRAM node, then
the redis will access the fast DRAM only.  Since the performance of DRAM
only is faster than partially accessing CXL memory in slow tiers, this
environment is not useful to evaluate