Re: [RFC v3 05/24] fs: add automatic kernel fs freeze / thaw and remove kthread freezing

2023-02-23 Thread Darrick J. Wong
On Fri, Jan 13, 2023 at 04:33:50PM -0800, Luis Chamberlain wrote:
> Add support to automatically handle freezing and thawing filesystems
> during the kernel's suspend/resume cycle.
> 
> This is needed so that we properly really stop IO in flight without
> races after userspace has been frozen. Without this we rely on
> kthread freezing and its semantics are loose and error prone.
> For instance, even though a kthread may use try_to_freeze() and end
> up being frozen we have no way of being sure that everything that
> has been spawned asynchronously from it (such as timers) have also
> been stopped as well.
> 
> A long term advantage of also adding filesystem freeze / thawing
> supporting during suspend / hibernation is that long term we may
> be able to eventually drop the kernel's thread freezing completely
> as it was originally added to stop disk IO in flight as we hibernate
> or suspend.

Hooray!

One evil question though --

Say you have dm devices A and B.  Each has a distinct fs on it.
If you mount A and then B and initiate a suspend, that should result in
first B and then A freezing, right?

After resuming, you then change A's dm-table definition to point it
at a loop device backed by a file on B.

What happens now when you initiate a suspend?  B freezes, then A tries
to flush data to the loop-mounted file on B, but it's too late for that.
That sounds like a deadlock?

Though I don't know how much we care about this corner case, since (a)
freezing has been busted on xfs for years and (b) one can think up all
sorts of horrid ouroborous scenarios like:

Change A's dm-table to point to a loop-mounted file on B, and changing B
to point to a loop-mounted file on A.  Then try to write to either
filesystem and see what kind of storm you get.

Anyway, just wondering if you'd thought about that kind of doomsday
scenario that a nutty sysadmin could set up.

The only way I can think of to solve that kind of thing would be to hook
filesystems and loop devices into the device model, make fs "device"
suspend actually freeze, hope the suspend code suspends from the leaves
inward, and hope I actually understand how the device model works (I
don't.)

--D

> This does not remove the superflous freezer calls on all filesystems.
> Each filesystem must remove all the kthread freezer stuff and peg
> the fs_type flags as supporting auto-freezing with the FS_AUTOFREEZE
> flag.
> 
> Subsequent patches remove the kthread freezer usage from each
> filesystem, one at a time to make all this work bisectable.
> Once all filesystems remove the usage of the kthread freezer we
> can remove the FS_AUTOFREEZE flag.
> 
> Signed-off-by: Luis Chamberlain 
> ---
>  fs/super.c | 69 ++
>  include/linux/fs.h | 14 +
>  kernel/power/process.c | 15 -
>  3 files changed, 97 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 2f77fcb6e555..e8af4c8269ad 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1853,3 +1853,72 @@ int thaw_super(struct super_block *sb, bool usercall)
>   return 0;
>  }
>  EXPORT_SYMBOL(thaw_super);
> +
> +#ifdef CONFIG_PM_SLEEP
> +static bool super_should_freeze(struct super_block *sb)
> +{
> + if (!(sb->s_type->fs_flags & FS_AUTOFREEZE))
> + return false;
> + /*
> +  * We don't freeze virtual filesystems, we skip those filesystems with
> +  * no backing device.
> +  */
> + if (sb->s_bdi == _backing_dev_info)
> + return false;
> +
> + return true;
> +}
> +
> +int fs_suspend_freeze_sb(struct super_block *sb, void *priv)
> +{
> + int error = 0;
> +
> + if (!grab_lock_super(sb)) {
> + pr_err("%s (%s): freezing failed to grab_super()\n",
> +sb->s_type->name, sb->s_id);
> + return -ENOTTY;
> + }
> +
> + if (!super_should_freeze(sb))
> + goto out;
> +
> + pr_info("%s (%s): freezing\n", sb->s_type->name, sb->s_id);
> +
> + error = freeze_super(sb, false);
> + if (!error)
> + lockdep_sb_freeze_release(sb);
> + else if (error != -EBUSY)
> + pr_notice("%s (%s): Unable to freeze, error=%d",
> +   sb->s_type->name, sb->s_id, error);
> +
> +out:
> + deactivate_locked_super(sb);
> + return error;
> +}
> +
> +int fs_suspend_thaw_sb(struct super_block *sb, void *priv)
> +{
> + int error = 0;
> +
> + if (!grab_lock_super(sb)) {
> + pr_err("%s (%s): thawing failed to grab_super()\n",
> +sb->s_type->name, sb->s_id);
> + return -ENOTTY;
> + }
> +
> + if (!super_should_freeze(sb))
> + goto out;
> +
> + pr_info("%s (%s): thawing\n", sb->s_type->name, sb->s_id);
> +
> + error = thaw_super(sb, false);
> + if (error && error != -EBUSY)
> + pr_notice("%s (%s): Unable to unfreeze, error=%d",
> +   sb->s_type->name, sb->s_id, error);
> +
> 

Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

2023-02-23 Thread Eric DeVolder



On 2/10/23 00:29, Sourabh Jain wrote:


On 10/02/23 01:09, Eric DeVolder wrote:



On 2/9/23 12:43, Sourabh Jain wrote:

Hello Eric,

On 09/02/23 23:01, Eric DeVolder wrote:



On 2/8/23 07:44, Thomas Gleixner wrote:

Eric!

On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:

On 2/1/23 05:33, Thomas Gleixner wrote:

So my latest solution is introduce two new CPUHP states, 
CPUHP_AP_ELFCOREHDR_ONLINE
for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm open to better 
names.

The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after CPUHP_BRINGUP_CPU. My
attempts at locating this state failed when inside the STARTING section, so I 
located
this just inside the ONLINE sectoin. The crash hotplug handler is registered on
this state as the callback for the .startup method.

The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before CPUHP_TEARDOWN_CPU, 
and I
placed it at the end of the PREPARE section. This crash hotplug handler is also
registered on this state as the callback for the .teardown method.


TBH, that's still overengineered. Something like this:

bool cpu_is_alive(unsigned int cpu)
{
struct cpuhp_cpu_state *st = per_cpu_ptr(_state, cpu);

return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
}

and use this to query the actual state at crash time. That spares all
those callback heuristics.


I'm making my way though percpu crash_notes, elfcorehdr, vmcoreinfo,
makedumpfile and (the consumer of it all) the userspace crash utility,
in order to understand the impact of moving from for_each_present_cpu()
to for_each_online_cpu().


Is the packing actually worth the trouble? What's the actual win?

Thanks,

 tglx




Thomas,
I've investigated the passing of crash notes through the vmcore. What I've 
learned is that:

- linux/fs/proc/vmcore.c (which makedumpfile references to do its job) does
  not care what the contents of cpu PT_NOTES are, but it does coalesce them 
together.

- makedumpfile will count the number of cpu PT_NOTES in order to determine its
  nr_cpus variable, which is reported in a header, but otherwise unused (except
  for sadump method).

- the crash utility, for the purposes of determining the cpus, does not appear 
to
  reference the elfcorehdr PT_NOTEs. Instead it locates the various
  cpu_[possible|present|online]_mask and computes nr_cpus from that, and also of
  course which are online. In addition, when crash does reference the cpu 
PT_NOTE,
  to get its prstatus, it does so by using a percpu technique directly in the 
vmcore
  image memory, not via the ELF structure. Said differently, it appears to me 
that
  crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather it obtains 
them
  via kernel cpumasks and the memory within the vmcore.

With this understanding, I did some testing. Perhaps the most telling test was 
that I
changed the number of cpu PT_NOTEs emitted in the crash_prepare_elf64_headers() 
to just 1,
hot plugged some cpus, then also took a few offline sparsely via chcpu, then 
generated a
vmcore. The crash utility had no problem loading the vmcore, it reported the 
proper number
of cpus and the number offline (despite only one cpu PT_NOTE), and changing to 
a different
cpu via 'set -c 30' and the backtrace was completely valid.

My take away is that crash utility does not rely upon ELF cpu PT_NOTEs, it 
obtains the
cpu information directly from kernel data structures. Perhaps at one time crash 
relied
upon the ELF information, but no more. (Perhaps there are other crash dump 
analyzers
that might rely on the ELF info?)

So, all this to say that I see no need to change crash_prepare_elf64_headers(). 
There
is no compelling reason to move away from for_each_present_cpu(), or modify the 
list for
online/offline.

Which then leaves the topic of the cpuhp state on which to register. Perhaps 
reverting
back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There does not 
appear to
be a compelling need to accurately track whether the cpu went online/offline 
for the
purposes of creating the elfcorehdr, as ultimately the crash utility pulls that 
from
kernel data structures, not the elfcorehdr.

I think this is what Sourabh has known and has been advocating for an 
optimization
path that allows not regenerating the elfcorehdr on cpu changes (because all 
the percpu
structs are all laid out). I do think it best to leave that as an arch choice.


Since things are clear on how the PT_NOTES are consumed in kdump kernel 
[fs/proc/vmcore.c],
makedumpfile, and crash tool I need your opinion on this:

Do we really need to regenerate elfcorehdr for CPU hotplug events?
If yes, can you please list the elfcorehdr components that changes due to CPU 
hotplug.

Due to the use of for_each_present_cpu(), it is possible for the number of cpu 
PT_NOTEs
to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does not impact 
the
number of cpu PT_NOTEs (as the cpus are still present).



 From what I understood, crash notes are prepared for 

[PATCH v3] arm64: kdump: simplify the reservation behaviour of crashkernel=,high

2023-02-23 Thread Baoquan He
On arm64, reservation for 'crashkernel=xM,high' is taken by searching for
suitable memory region top down. If the 'xM' of crashkernel high memory
is reserved from high memory successfully, it will try to reserve
crashkernel low memory later accoringly. Otherwise, it will try to search
low memory area for the 'xM' suitable region. Please see the details in
Documentation/admin-guide/kernel-parameters.txt.

While we observed an unexpected case where a reserved region crosses the
high and low meomry boundary. E.g on a system with 4G as low memory end,
user added the kernel parameters like: 'crashkernel=512M,high', it could
finally have [4G-126M, 4G+386M], [1G, 1G+128M] regions in running kernel.
The crashkernel high region crossing low and high memory boudary will bring
issues:

1) For crashkernel=x,high, if getting crashkernel high region across
low and high memory boundary, then user will see two memory regions in
low memory, and one memory region in high memory. The two crashkernel
low memory regions are confusing as shown in above example.

2) If people explicityly specify "crashkernel=x,high crashkernel=y,low"
and y <= 128M, when crashkernel high region crosses low and high memory
boundary and the part of crashkernel high reservation below boundary is
bigger than y, the expected crahskernel low reservation will be skipped.
But the expected crashkernel high reservation is shrank and could not
satisfy user space requirement.

3) The crossing boundary behaviour of crahskernel high reservation is
different than x86 arch. On x86_64, the low memory end is 4G fixedly,
and the memory near 4G is reserved by system, e.g for mapping firmware,
pci mapping, so the crashkernel reservation crossing boundary never happens.
>From distros point of view, this brings inconsistency and confusion. Users
need to dig into x86 and arm64 system details to find out why.

For kernel itself, the impact of issue 3) could be slight. While issue
1) and 2) cause actual impact because it brings obscure semantics and
behaviour to crashkernel=,high reservation.

Here, for crashkernel=xM,high, search the high memory for the suitable
region only in high memory. If failed, try reserving the suitable
region only in low memory. Like this, the crashkernel high region will
only exist in high memory, and crashkernel low region only exists in low
memory. The reservation behaviour for crashkernel=,high is clearer and
simpler.

Note: On arm64, the high and low memory boudary could be 1G if it's RPi4
system, or 4G if other normal systems.

Signed-off-by: Baoquan He 
---
v2->v3:
 - Rephrase patch log to clarify the current crashkernel high
   reservation could cross the high and low memory boundary, but not 
   4G boundary only, because RPi4 of arm64 has high and low memory
   boudary as 1G. The v3 patch log could mislead people that the RPi4
   also use 4G as high,low memory boundary.
v1->v2:
 - Fold patch 2 of v1 into patch 1 for better reviewing.
 - Update patch log to add more details.
 arch/arm64/mm/init.c | 43 +--
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 58a0bb2c17f1..b8cb780df0cb 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -127,12 +127,13 @@ static int __init reserve_crashkernel_low(unsigned long 
long low_size)
  */
 static void __init reserve_crashkernel(void)
 {
-   unsigned long long crash_base, crash_size;
-   unsigned long long crash_low_size = 0;
+   unsigned long long crash_base, crash_size, search_base;
unsigned long long crash_max = CRASH_ADDR_LOW_MAX;
+   unsigned long long crash_low_size = 0;
char *cmdline = boot_command_line;
-   int ret;
bool fixed_base = false;
+   bool high = false;
+   int ret;
 
if (!IS_ENABLED(CONFIG_KEXEC_CORE))
return;
@@ -155,7 +156,9 @@ static void __init reserve_crashkernel(void)
else if (ret)
return;
 
+   search_base = CRASH_ADDR_LOW_MAX;
crash_max = CRASH_ADDR_HIGH_MAX;
+   high = true;
} else if (ret || !crash_size) {
/* The specified value is invalid */
return;
@@ -166,31 +169,51 @@ static void __init reserve_crashkernel(void)
/* User specifies base address explicitly. */
if (crash_base) {
fixed_base = true;
+   search_base = crash_base;
crash_max = crash_base + crash_size;
}
 
 retry:
crash_base = memblock_phys_alloc_range(crash_size, CRASH_ALIGN,
-  crash_base, crash_max);
+  search_base, crash_max);
if (!crash_base) {
/*
-* If the first attempt was for low memory, fall back to
-* high memory, the minimum required low memory will be
-* 

Re: [PATCH v2] x86: add devicetree support

2023-02-23 Thread Simon Horman
On Thu, Feb 23, 2023 at 07:01:07AM +0100, Julian Winkler wrote:
> Since linux kernel has dropped support for simple firmware interface
> (SFI), the only way of boot newer versions on intel MID platform is
> using devicetree
> 
> Signed-off-by: Julian Winkler 
> ---
> V2: added Signed-off-by to commit message

Thanks, applied.

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec