Re: panic context: was: Re: [PATCH printk v2 04/11] printk: nbcon: Provide functions to mark atomic write sections
Hi Dave, On 2023-10-16, Dave Young wrote: >> > Does anyone really want explicit flushes in panic()? >> >> So far you are the only one speaking against it. I expect as time >> goes on it will get even more complex as it becomes tunable (also >> something we talked about during the demo). > > Flush consoles in panic kexec case sounds not good, but I have no deep > understanding about the atomic printk series, added kexec list and > reviewers in cc. Currently every printk() message tries to flush immediately. This series introduced a new method of first allowing a set of printk() messages to be stored to the ringbuffer and then flushing the full set. That is what this discussion was about. The issue with allowing a set of printk() messages to be stored is that you need to explicitly mark in code where the actual flushing should occur. Petr's argument is that we do not want to insert "flush points" into the panic() function and instead we should do as we do now: flush each printk() message immediately. In the end (for my upcoming v3 series) I agreed with Petr. We will continue to keep things as they are now: flush each printk() message immediately. Currently consoles try to flush unsafely before kexec. With the atomic printk series our goal is to only perform _safe_ flushing until all other panic operations are complete. Only at the very end of panic() would unsafe flushing be attempted. John ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] docs: gdbmacros: print newest record
@head_id points to the newest record, but the printing loop exits when it increments to this value (before printing). Exit the printing loop after the newest record has been printed. The python-based function in scripts/gdb/linux/dmesg.py already does this correctly. Fixes: e60768311af8 ("scripts/gdb: update for lockless printk ringbuffer") Cc: sta...@vger.kernel.org Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 82aecdcae8a6..030de95e3e6b 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -312,10 +312,10 @@ define dmesg set var $prev_flags = $info->flags end - set var $id = ($id + 1) & $id_mask if ($id == $end_id) loop_break end + set var $id = ($id + 1) & $id_mask end end document dmesg base-commit: 1b929c02afd37871d5afb9d498426f83432e71c2 -- 2.30.2 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path
On 2022-05-10, Steven Rostedt wrote: >> As already mentioned in the other reply, panic() sometimes stops the >> other CPUs using NMI, for example, see kdump_nmi_shootdown_cpus(). >> >> Another situation is when the CPU using the lock ends in some >> infinite loop because something went wrong. The system is in >> an unpredictable state during panic(). >> >> I am not sure if this is possible with the code under gsmi_dev.lock >> but such things really happen during panic() in other subsystems. >> Using trylock in the panic() code path is a good practice. > > I believe that Peter Zijlstra had a special spin lock for NMIs or > early printk, where it would not block if the lock was held on the > same CPU. That is, if an NMI happened and paniced while this lock was > held on the same CPU, it would not deadlock. But it would block if the > lock was held on another CPU. Yes. And starting with 5.19 it will be carrying the name that _you_ came up with (cpu_sync): printk_cpu_sync_get_irqsave() printk_cpu_sync_put_irqrestore() John ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v4 0/6] printk: remove safe buffers
Hi, Here is v4 of a series to remove the safe buffers. v3 can be found here [0]. The safe buffers are no longer needed because messages can be stored directly into the log buffer from any context. However, the safe buffers also provided a form of recursion protection. For that reason, explicit recursion protection is implemented for this series. The safe buffers also implicitly provided serialization between multiple CPUs executing in NMI context. This was particularly necessary for the nmi_backtrace() output. This serializiation is now preserved by using the printk cpulock. With the removal of the safe buffers, there is no need for extra NMI enter/exit tracking. So this is also removed (which includes removing the config option CONFIG_PRINTK_NMI). And finally, there are a few places in the kernel that need to specify code blocks where all printk calls are to be deferred printing. Previously the NMI tracking API was being (mis)used for this purpose. This series introduces an official and explicit interface for such cases. (Note that all deferred printing will be removed anyway, once printing kthreads are introduced.) Changes since v3: - Remove safe context tracking in vprintk(). - Add safe context tracking for @console_owner usage since that is also a component of the printing code. - Refactor syslog_print() so that it is easier to understand and follow the locking logic. - Introduce printk_deferred_enter/exit functions to be used by code that needs to specify code block where all printk calls are to be deferred printing. John Ogness [0] https://lore.kernel.org/lkml/2021062448.5190-1-john.ogn...@linutronix.de John Ogness (6): lib/nmi_backtrace: explicitly serialize banner and regs printk: track/limit recursion printk: remove safe buffers printk: remove NMI tracking printk: convert @syslog_lock to mutex printk: syslog: close window between wait and read arch/arm/kernel/smp.c | 4 +- arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - arch/powerpc/kexec/crash.c | 2 +- include/linux/hardirq.h| 2 - include/linux/printk.h | 41 ++-- init/Kconfig | 5 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 25 --- kernel/printk/printk.c | 268 ++-- kernel/printk/printk_safe.c| 364 + kernel/trace/trace.c | 4 +- lib/nmi_backtrace.c| 13 +- 14 files changed, 194 insertions(+), 544 deletions(-) base-commit: 70333dec446292cd896cd051d2ebd6808b328949 -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v4 3/6] printk: remove safe buffers
With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. Although the safe buffers are removed, the NMI and safe context tracking is still in place. In these contexts, store the message immediately but still use irq_work to defer the console printing. Since printk recursion tracking is in place, safe context tracking for most of printk is not needed. Remove it. Only safe context tracking relating to the console and console_owner locks is left in place. This is because the console and console_owner locks are needed for the actual printing. Signed-off-by: John Ogness --- arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - include/linux/printk.h | 10 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 17 -- kernel/printk/printk.c | 120 +--- kernel/printk/printk_safe.c| 335 + lib/nmi_backtrace.c| 6 - 9 files changed, 48 insertions(+), 450 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index b4ab95c9e94a..2522800217d1 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -170,7 +170,6 @@ extern void panic_flush_kmsg_start(void) extern void panic_flush_kmsg_end(void) { - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); bust_spinlocks(0); debug_locks_off(); diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c index c9a8f4781a10..dc17d8903d4f 100644 --- a/arch/powerpc/kernel/watchdog.c +++ b/arch/powerpc/kernel/watchdog.c @@ -183,11 +183,6 @@ static void watchdog_smp_panic(int cpu, u64 tb) wd_smp_unlock(&flags); - printk_safe_flush(); - /* -* printk_safe_flush() seems to require another print -* before anything actually goes out to console. -*/ if (sysctl_hardlockup_all_cpu_backtrace) trigger_allbutself_cpu_backtrace(); diff --git a/include/linux/printk.h b/include/linux/printk.h index 1790a5521fd9..664612f75dac 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); void dump_stack_print_info(const char *log_lvl); void show_regs_print_info(const char *log_lvl); extern asmlinkage void dump_stack(void) __cold; -extern void printk_safe_flush(void); -extern void printk_safe_flush_on_panic(void); #else static inline __printf(1, 0) int vprintk(const char *s, va_list args) @@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char *log_lvl) static inline void dump_stack(void) { } - -static inline void printk_safe_flush(void) -{ -} - -static inline void printk_safe_flush_on_panic(void) -{ -} #endif #ifdef CONFIG_SMP diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index f099baee3578..69c6e9b7761c 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -978,7 +978,6 @@ void crash_kexec(struct pt_regs *regs) old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu); if (old_cpu == PANIC_CPU_INVALID) { /* This is the 1st CPU which comes here, so go ahead. */ - printk_safe_flush_on_panic(); __crash_kexec(regs); /* diff --git a/kernel/panic.c b/kernel/panic.c index 332736a72a58..1f0df42f8d0c 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -247,7 +247,6 @@ void panic(const char *fmt, ...) * Bypass the panic_cpu check and call __crash_kexec directly. */ if (!_crash_kexec_post_notifiers) { - printk_safe_flush_on_panic(); __crash_kexec(NULL); /* @@ -271,8 +270,6 @@ void panic(const char *fmt, ...) */ atomic_notifier_call_chain(&panic_notifier_list, 0, buf); - /* Call flush even twice. It tries harder with a single online CPU */ - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); /* diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index 51615c909b2f..6cc35c5de890 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -22,7 +22,6 @@ __printf(1, 0) int vprintk_deferred(const char *fmt, va_list args); void __printk_safe_enter(void); void __printk_safe_exit(void); -void printk_safe_init(void); bool printk_percpu_data_ready(void); #define printk_safe_enter_irqsave(flags) \ @@ -37,18 +36,6 @@ bool printk_percpu_data_ready(void); local_irq_restore(flags); \ } while (0) -#define printk_safe_enter_irq()\ - do {\ - local_irq_disable();\ - __print
Re: [PATCH printk v3 3/6] printk: remove safe buffers
On 2021-06-24, Petr Mladek wrote: >> --- a/kernel/printk/printk.c >> +++ b/kernel/printk/printk.c >> @@ -1852,7 +1839,7 @@ static int console_trylock_spinning(void) >> if (console_trylock()) >> return 1; >> >> -printk_safe_enter_irqsave(flags); >> +local_irq_save(flags); >> >> raw_spin_lock(&console_owner_lock); > > This spin_lock is in the printk() path. We must make sure that > it does not cause deadlock. > > printk_safe_enter_irqsave(flags) prevented the recursion because > it deferred the console handling. > > One danger might be a lockdep report triggered by > raw_spin_lock(&console_owner_lock) itself. But it should be safe. > lockdep is checked before the lock is actually taken > and lockdep should disable itself before printing anything. > > Another danger might be any printk() called under the lock. > The code just compares and assigns values to some variables > (static, on stack) so we should be on the safe side. > > Well, I would feel more comfortable if we add printk_safe_enter_irqsave() > back around the sections guarded by this lock. It can be done > in a separate patch. The code looks safe at the moment. You are correct. printk_safe should also be wrapping @console_owner_lock locking. >> @@ -2716,19 +2700,22 @@ void console_unlock(void) >> * were to occur on another CPU, it may wait for this one to >> * finish. This task can not be preempted if there is a >> * waiter waiting to take over. >> + * >> + * Interrupts are disabled because the hand over to a waiter >> + * must not be interrupted until the hand over is completed >> + * (@console_waiter is cleared). >> */ >> +local_irq_save(flags); >> console_lock_spinning_enable(); > > Same here. console_lock_spinning_enable() takes console_owner_lock. > I would feel more comfortable if we added printk_safe_enter_irqsave(flags) > inside console_lock_spinning_enable() around the locked code. The code > is safe at the moment but... Agreed. >> stop_critical_timings();/* don't trace print latency */ >> call_console_drivers(ext_text, ext_len, text, len); >> start_critical_timings(); >> >> -if (console_lock_spinning_disable_and_check()) { >> -printk_safe_exit_irqrestore(flags); >> +handover = console_lock_spinning_disable_and_check(); > > Same here. Also console_lock_spinning_disable_and_check() takes > console_owner_lock. It looks safe at the moment but... Agreed. >> --- a/kernel/printk/printk_safe.c >> +++ b/kernel/printk/printk_safe.c >> @@ -369,7 +70,10 @@ asmlinkage int vprintk(const char *fmt, va_list args) >> * Use the main logbuf even in NMI. But avoid calling console >> * drivers that might have their own locks. >> */ >> -if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK)) { >> +if (this_cpu_read(printk_context) & >> +(PRINTK_NMI_DIRECT_CONTEXT_MASK | >> + PRINTK_NMI_CONTEXT_MASK | >> + PRINTK_SAFE_CONTEXT_MASK)) { >> unsigned long flags; >> int len; >> > > There is the following code right below: > > printk_safe_enter_irqsave(flags); > len = vprintk_store(0, LOGLEVEL_DEFAULT, NULL, fmt, args); > printk_safe_exit_irqrestore(flags); > defer_console_output(); > return len; > > printk_safe_enter_irqsave(flags) is not needed here. Any nested > printk() ends here as well. Ah, I missed that one. Good eye! > Against this can be done in a separate patch. Well, the commit message > mentions that the printk_safe context is removed everywhere except > for the code manipulating console lock. But is it just a detail. I would prefer a v4 with these fixes: - wrap @console_owner_lock with printk_safe usage - remove unnecessary printk_safe usage from printk_safe.c - update commit message to say that safe context tracking is left in place for both the console and console_owner locks John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v3 3/6] printk: remove safe buffers
With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. Although the safe buffers are removed, the NMI and safe context tracking is still in place. In these contexts, store the message immediately but still use irq_work to defer the console printing. Since printk recursion tracking is in place, safe context tracking for most of printk is not needed. Remove it. Only safe context tracking relating to the console lock is left in place. This is because the console lock is needed for the actual printing. Signed-off-by: John Ogness --- arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - include/linux/printk.h | 10 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 17 -- kernel/printk/printk.c | 126 + kernel/printk/printk_safe.c| 332 + lib/nmi_backtrace.c| 6 - 9 files changed, 51 insertions(+), 450 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index a44a30b0688c..5828c83eaca6 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -171,7 +171,6 @@ extern void panic_flush_kmsg_start(void) extern void panic_flush_kmsg_end(void) { - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); bust_spinlocks(0); debug_locks_off(); diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c index c9a8f4781a10..dc17d8903d4f 100644 --- a/arch/powerpc/kernel/watchdog.c +++ b/arch/powerpc/kernel/watchdog.c @@ -183,11 +183,6 @@ static void watchdog_smp_panic(int cpu, u64 tb) wd_smp_unlock(&flags); - printk_safe_flush(); - /* -* printk_safe_flush() seems to require another print -* before anything actually goes out to console. -*/ if (sysctl_hardlockup_all_cpu_backtrace) trigger_allbutself_cpu_backtrace(); diff --git a/include/linux/printk.h b/include/linux/printk.h index 1790a5521fd9..664612f75dac 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); void dump_stack_print_info(const char *log_lvl); void show_regs_print_info(const char *log_lvl); extern asmlinkage void dump_stack(void) __cold; -extern void printk_safe_flush(void); -extern void printk_safe_flush_on_panic(void); #else static inline __printf(1, 0) int vprintk(const char *s, va_list args) @@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char *log_lvl) static inline void dump_stack(void) { } - -static inline void printk_safe_flush(void) -{ -} - -static inline void printk_safe_flush_on_panic(void) -{ -} #endif #ifdef CONFIG_SMP diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index a0b6780740c8..480d5f77ef4f 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -977,7 +977,6 @@ void crash_kexec(struct pt_regs *regs) old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu); if (old_cpu == PANIC_CPU_INVALID) { /* This is the 1st CPU which comes here, so go ahead. */ - printk_safe_flush_on_panic(); __crash_kexec(regs); /* diff --git a/kernel/panic.c b/kernel/panic.c index 332736a72a58..1f0df42f8d0c 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -247,7 +247,6 @@ void panic(const char *fmt, ...) * Bypass the panic_cpu check and call __crash_kexec directly. */ if (!_crash_kexec_post_notifiers) { - printk_safe_flush_on_panic(); __crash_kexec(NULL); /* @@ -271,8 +270,6 @@ void panic(const char *fmt, ...) */ atomic_notifier_call_chain(&panic_notifier_list, 0, buf); - /* Call flush even twice. It tries harder with a single online CPU */ - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); /* diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index 51615c909b2f..6cc35c5de890 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -22,7 +22,6 @@ __printf(1, 0) int vprintk_deferred(const char *fmt, va_list args); void __printk_safe_enter(void); void __printk_safe_exit(void); -void printk_safe_init(void); bool printk_percpu_data_ready(void); #define printk_safe_enter_irqsave(flags) \ @@ -37,18 +36,6 @@ bool printk_percpu_data_ready(void); local_irq_restore(flags); \ } while (0) -#define printk_safe_enter_irq()\ - do {\ - local_irq_disable();\ - __printk_safe_enter(); \ - } while (0
[PATCH printk v3 0/6] printk: remove safe buffers
Hi, Here is v3 of a series to remove the safe buffers. v2 can be found here [0]. The safe buffers are no longer needed because messages can be stored directly into the log buffer from any context. However, the safe buffers also provided a form of recursion protection. For that reason, explicit recursion protection is implemented for this series. The safe buffers also implicitly provided serialization between multiple CPUs executing in NMI context. This was particularly necessary for the nmi_backtrace() output. This serializiation is now preserved by using the printk_cpu_lock. And finally, with the removal of the safe buffers, there is no need for extra NMI enter/exit tracking. So this is also removed (which includes removing config option CONFIG_PRINTK_NMI). Changes since v2: - Move irq disabling/enabling out of the console_lock_spinning_*() functions to simplify the patches keep the function prototypes simple. - Change printk_enter_irqsave()/printk_exit_irqrestore() to macros to allow a more common calling convention for irq flags. - Use the counter pointer from printk_enter_irqsave() in printk_exit_irqrestore() rather than fetching it again. This avoids any possible race conditions when printk's percpu flag is set. - Use the printk_cpu_lock to serialize banner and regs with the stack dump in nmi_cpu_backtrace(). John Ogness [0] https://lore.kernel.org/lkml/20210330153512.1182-1-john.ogn...@linutronix.de John Ogness (6): lib/nmi_backtrace: explicitly serialize banner and regs printk: track/limit recursion printk: remove safe buffers printk: remove NMI tracking printk: convert @syslog_lock to mutex printk: syslog: close window between wait and read arch/arm/kernel/smp.c | 2 - arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - arch/powerpc/kexec/crash.c | 3 - include/linux/hardirq.h| 2 - include/linux/printk.h | 22 -- init/Kconfig | 5 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 23 --- kernel/printk/printk.c | 273 +++-- kernel/printk/printk_safe.c| 361 + kernel/trace/trace.c | 2 - lib/nmi_backtrace.c| 13 +- 14 files changed, 176 insertions(+), 540 deletions(-) base-commit: 48e72544d6f06daedbf1d9b14610be89dba67526 -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH printk v2 2/5] printk: remove safe buffers
On 2021-04-01, Petr Mladek wrote: >> Caller-id solves this problem and is easy to sort for anyone with >> `grep'. Yes, it is a shame that `dmesg' does not show it, but >> directly using any of the printk interfaces does show it (kmsg_dump, >> /dev/kmsg, syslog, console). > > True but frankly, the current situation is _far_ from convenient: > >+ consoles do not show it by default >+ none userspace tool (dmesg, journalctl, crash) is able to show it >+ grep is a nightmare, especially if you have more than handful of CPUs > > Yes, everything is solvable but not easily. > >> > I get this with "echo l >/proc/sysrq-trigger" and this patchset: >> >> Of course. Without caller-id, it is a mess. But this has nothing to do >> with NMI. The same problem exists for WARN_ON() on multiple CPUs >> simultaneously. If the user is not using caller-id, they are >> lost. Caller-id is the current solution to the interlaced logs. > > Sure. But in reality, the risk of mixed WARN_ONs is small. While > this patch makes backtraces from all CPUs always unusable without > caller_id and non-trivial effort. I would prefer we solve the situation for non-NMI as well, not just for the sysrq "l" case. >> For the long term, we should introduce a printk-context API that allows >> callers to perfectly pack their multi-line output into a single >> entry. We discussed [0][1] this back in August 2020. > > We need a "short" term solution. There are currently 3 solutions: > > 1. Keep nmi_safe() and all the hacks around. > > 2. Serialize nmi_cpu_backtrace() by a spin lock and later by >the special lock used also by atomic consoles. > > 3. Tell complaining people how to sort the messed logs. Or we look into the long term solution now. If caller-id's cannot not be used as the solution (because nobody turns it on, nobody knows about it, and/or distros do not enable it), then we should look at how to make at least the backtraces contiguous. I have a few ideas here. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v2 0/5] printk: remove safe buffers
Hi, Here is v2 of a series to remove the safe buffers. v1 can be found here [0]. The safe buffers are no longer needed because messages can be stored directly into the log buffer from any context. However, the safe buffers also provided a form of recursion protection. For that reason, explicit recursion protection is also implemented for this series. And finally, with the removal of the safe buffers, there is no need for extra NMI enter/exit tracking. So this is also removed (which includes removing config option CONFIG_PRINTK_NMI). This series is based on the printk-rework branch of printk/linux.git: commit acebb5597ff1 ("kernel/printk.c: Fixed mundane typos") Changes since v1: - remove the printk nmi enter/exit tracking - remove CONFIG_PRINTK_NMI config option - use in_nmi() to detect NMI context - remove unused printk_safe_enter/exit macros - after switching to the dynamic buffer, copy over NMI records that may have arrived during the switch window - use local_irq_*() instead of printk_safe_*() for console spinning - use separate variables rather than arrays for the per-cpu recursion tracking - make @syslog_lock a mutex instead of a spin_lock - close the wait-read window for SYSLOG_ACTION_READ - adjust various comments and commit messages as requested John Ogness [0] https://lore.kernel.org/lkml/20210316233326.10778-1-john.ogn...@linutronix.de John Ogness (5): printk: track/limit recursion printk: remove safe buffers printk: remove NMI tracking printk: convert @syslog_lock to mutex printk: syslog: close window between wait and read arch/arm/kernel/smp.c | 2 - arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - arch/powerpc/kexec/crash.c | 3 - include/linux/hardirq.h| 2 - include/linux/printk.h | 22 -- init/Kconfig | 5 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 23 --- kernel/printk/printk.c | 281 +++-- kernel/printk/printk_safe.c| 362 + kernel/trace/trace.c | 2 - lib/nmi_backtrace.c| 6 - 14 files changed, 171 insertions(+), 547 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH printk v2 2/5] printk: remove safe buffers
On 2021-04-01, Petr Mladek wrote: >> --- a/kernel/printk/printk.c >> +++ b/kernel/printk/printk.c >> @@ -1142,24 +1128,37 @@ void __init setup_log_buf(int early) >> new_descs, ilog2(new_descs_count), >> new_infos); >> >> -printk_safe_enter_irqsave(flags); >> +local_irq_save(flags); > > IMHO, we actually do not have to disable IRQ here. We already copy > messages that might appear in the small race window in NMI. It would > work the same way also for IRQ context. We do not have to, but why open up this window? We are still in early boot and interrupts have always been disabled here. I am not happy that this window even exists. I really prefer to keep it NMI-only. >> --- a/lib/nmi_backtrace.c >> +++ b/lib/nmi_backtrace.c >> @@ -75,12 +75,6 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask, >> touch_softlockup_watchdog(); >> } >> >> -/* >> - * Force flush any remote buffers that might be stuck in IRQ context >> - * and therefore could not run their irq_work. >> - */ >> -printk_safe_flush(); > > Sigh, this reminds me that the nmi_safe buffers serialized backtraces > from all CPUs. > > I am afraid that we have to put back the spinlock into > nmi_cpu_backtrace(). Please no. That spinlock is a disaster. It can cause deadlocks with other cpu-locks (such as in kdb) and it will cause a major problem for atomic consoles. We need to be very careful about introducing locks where NMIs are waiting on other CPUs. > It has been repeatedly added and removed depending > on whether the backtrace was printed into the main log buffer > or into the per-CPU buffers. Last time it was removed by > the commit 03fc7f9c99c1e7ae2925d ("printk/nmi: Prevent deadlock > when accessing the main log buffer in NMI"). > > It should be safe because there should not be any other locks in the > code path. Note that only one backtrace might be triggered at the same > time, see @backtrace_flag in nmi_trigger_cpumask_backtrace(). It is adding a lock around a lockless ringbuffer. For me that is a step backwards. > We _must_ serialize it somehow[*]. The lock in nmi_cpu_backtrace() > looks less evil than the nmi_safe machinery. nmi_safe() shrinks > too long backtraces, lose timestamps, needs to be explicitely > flushed here and there, is a non-trivial code. > > [*] Non-serialized bactraces are real mess. Caller-id is visible > only on consoles or via syslogd interface. And it is not much > convenient. Caller-id solves this problem and is easy to sort for anyone with `grep'. Yes, it is a shame that `dmesg' does not show it, but directly using any of the printk interfaces does show it (kmsg_dump, /dev/kmsg, syslog, console). > I get this with "echo l >/proc/sysrq-trigger" and this patchset: Of course. Without caller-id, it is a mess. But this has nothing to do with NMI. The same problem exists for WARN_ON() on multiple CPUs simultaneously. If the user is not using caller-id, they are lost. Caller-id is the current solution to the interlaced logs. For the long term, we should introduce a printk-context API that allows callers to perfectly pack their multi-line output into a single entry. We discussed [0][1] this back in August 2020. John Ogness [0] https://lore.kernel.org/lkml/472f2e553805b52d9834d64e4056db965edee329.ca...@perches.com [1] offlist message-id: 87d03k9ymz@jogness.linutronix.de ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH printk v2 2/5] printk: remove safe buffers
On 2021-03-30, John Ogness wrote: > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c > index e971c0a9ec9e..f090d6a1b39e 100644 > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -1772,16 +1759,21 @@ static struct task_struct *console_owner; > static bool console_waiter; > > /** > - * console_lock_spinning_enable - mark beginning of code where another > + * console_lock_spinning_enable_irqsave - mark beginning of code where > another > * thread might safely busy wait > * > * This basically converts console_lock into a spinlock. This marks > * the section where the console_lock owner can not sleep, because > * there may be a waiter spinning (like a spinlock). Also it must be > * ready to hand over the lock at the end of the section. > + * > + * This disables interrupts because the hand over to a waiter must not be > + * interrupted until the hand over is completed (@console_waiter is cleared). > */ > -static void console_lock_spinning_enable(void) > +static void console_lock_spinning_enable_irqsave(unsigned long *flags) I missed the prototype change for the !CONFIG_PRINTK case, resulting in: linux/kernel/printk/printk.c:2707:3: error: implicit declaration of function ‘console_lock_spinning_enable_irqsave’; did you mean ‘console_lock_spinning_enable’? [-Werror=implicit-function-declaration] console_lock_spinning_enable_irqsave(&flags); ^~~~ console_lock_spinning_enable Will be fixed for v3. (I have now officially added !CONFIG_PRINTK to my CI tests.) John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v2 2/5] printk: remove safe buffers
With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. Although the safe buffers are removed, the NMI and safe context tracking is still in place. In these contexts, store the message immediately but still use irq_work to defer the console printing. Since printk recursion tracking is in place, safe context tracking for most of printk is not needed. Remove it. Only safe context tracking relating to the console lock is left in place. This is because the console lock is needed for the actual printing. Signed-off-by: John Ogness --- Note: The follow-up patch removes the NMI tracking. arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - include/linux/printk.h | 10 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 17 -- kernel/printk/printk.c | 137 +- kernel/printk/printk_safe.c| 333 + lib/nmi_backtrace.c| 6 - 9 files changed, 56 insertions(+), 457 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 3ec7b443fe6b..7d2b339afcb0 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -170,7 +170,6 @@ extern void panic_flush_kmsg_start(void) extern void panic_flush_kmsg_end(void) { - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); bust_spinlocks(0); debug_locks_off(); diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c index af3c15a1d41e..8ae46c5945d0 100644 --- a/arch/powerpc/kernel/watchdog.c +++ b/arch/powerpc/kernel/watchdog.c @@ -181,11 +181,6 @@ static void watchdog_smp_panic(int cpu, u64 tb) wd_smp_unlock(&flags); - printk_safe_flush(); - /* -* printk_safe_flush() seems to require another print -* before anything actually goes out to console. -*/ if (sysctl_hardlockup_all_cpu_backtrace) trigger_allbutself_cpu_backtrace(); diff --git a/include/linux/printk.h b/include/linux/printk.h index fe7eb2351610..2476796c1150 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); void dump_stack_print_info(const char *log_lvl); void show_regs_print_info(const char *log_lvl); extern asmlinkage void dump_stack(void) __cold; -extern void printk_safe_flush(void); -extern void printk_safe_flush_on_panic(void); #else static inline __printf(1, 0) int vprintk(const char *s, va_list args) @@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char *log_lvl) static inline void dump_stack(void) { } - -static inline void printk_safe_flush(void) -{ -} - -static inline void printk_safe_flush_on_panic(void) -{ -} #endif extern int kptr_restrict; diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index a0b6780740c8..480d5f77ef4f 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -977,7 +977,6 @@ void crash_kexec(struct pt_regs *regs) old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu); if (old_cpu == PANIC_CPU_INVALID) { /* This is the 1st CPU which comes here, so go ahead. */ - printk_safe_flush_on_panic(); __crash_kexec(regs); /* diff --git a/kernel/panic.c b/kernel/panic.c index 332736a72a58..1f0df42f8d0c 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -247,7 +247,6 @@ void panic(const char *fmt, ...) * Bypass the panic_cpu check and call __crash_kexec directly. */ if (!_crash_kexec_post_notifiers) { - printk_safe_flush_on_panic(); __crash_kexec(NULL); /* @@ -271,8 +270,6 @@ void panic(const char *fmt, ...) */ atomic_notifier_call_chain(&panic_notifier_list, 0, buf); - /* Call flush even twice. It tries harder with a single online CPU */ - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); /* diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index 51615c909b2f..6cc35c5de890 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -22,7 +22,6 @@ __printf(1, 0) int vprintk_deferred(const char *fmt, va_list args); void __printk_safe_enter(void); void __printk_safe_exit(void); -void printk_safe_init(void); bool printk_percpu_data_ready(void); #define printk_safe_enter_irqsave(flags) \ @@ -37,18 +36,6 @@ bool printk_percpu_data_ready(void); local_irq_restore(flags); \ } while (0) -#define printk_safe_enter_irq()\ - do {\ - local
Re: [PATCH next v1 2/3] printk: remove safe buffers
On 2021-03-29, Petr Mladek wrote: > I wonder if some console drivers rely on the fact that the write() > callback is called with interrupts disabled. > > IMHO, it would be a bug when any write() callback expects that > callers disabled the interrupts. Agreed. > Do you plan to remove the console-spinning stuff after offloading > consoles to the kthreads? Yes. Although a similar concept will be introduced to allow the threaded printers and the atomic consoles to compete. > Will you call console write() callback with irq enabled from the > kthread? No. That defeats the fundamental purpose of this entire rework excercise. ;-) > Anyway, we should at least add a comment why the interrupts are > disabled. I decided to move the local_irq_save/restore inside the console-spinning functions and added a comment for v2. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH next v1 2/3] printk: remove safe buffers
On 2021-03-29, John Ogness wrote: >> Will you call console write() callback with irq enabled from the >> kthread? > > No. That defeats the fundamental purpose of this entire rework > excercise. ;-) Sorry, I misread your question. The answer is "yes". We want to avoid a local_irq_save() when calling into console->write(). John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH next v1 2/3] printk: remove safe buffers
On 2021-03-23, Petr Mladek wrote: >> --- a/kernel/printk/printk.c >> +++ b/kernel/printk/printk.c >> @@ -1142,8 +1126,6 @@ void __init setup_log_buf(int early) >> new_descs, ilog2(new_descs_count), >> new_infos); >> >> -printk_safe_enter_irqsave(flags); >> - >> log_buf_len = new_log_buf_len; >> log_buf = new_log_buf; >> new_log_buf_len = 0; >> @@ -1159,8 +1141,6 @@ void __init setup_log_buf(int early) >> */ >> prb = &printk_rb_dynamic; >> >> -printk_safe_exit_irqrestore(flags); > > This will allow to add new messages from the IRQ context when we > are copying them to the new buffer. They might get lost in > the small race window. > > Also the messages from NMI might get lost because they are not > longer stored in the per-CPU buffer. > > A possible solution might be to do something like this: > > prb_for_each_record(0, &printk_rb_static, seq, &r) > free -= add_to_rb(&printk_rb_dynamic, &r); > > prb = &printk_rb_dynamic; > > /* >* Copy the remaining messages that might have appeared >* from IRQ or NMI context after we ended copying and >* before we switched the buffers. They must be finalized >* because only one CPU is up at this stage. >*/ > prb_for_each_record(seq, &printk_rb_static, seq, &r) > free -= add_to_rb(&printk_rb_dynamic, &r); OK. I'll probably rework it some and combine it with the "dropped" test so that we can identify if messages were dropped during the transition (because of static ringbuffer overrun). >> - >> if (seq != prb_next_seq(&printk_rb_static)) { >> pr_err("dropped %llu messages\n", >> prb_next_seq(&printk_rb_static) - seq); >> @@ -2666,7 +2631,6 @@ void console_unlock(void) >> size_t ext_len = 0; >> size_t len; >> >> -printk_safe_enter_irqsave(flags); >> skip: >> if (!prb_read_valid(prb, console_seq, &r)) >> break; >> @@ -2711,6 +2675,8 @@ void console_unlock(void) >> printk_time); >> console_seq++; >> >> +printk_safe_enter_irqsave(flags); > > What is the purpose of the printk_safe context here, please? console_lock_spinning_enable() needs to be called with interrupts disabled. I should have just used local_irq_save(). I could add local_irq_save() to console_lock_spinning_enable() and restore them at the end of console_lock_spinning_disable_and_check(), but then I would need to add a @flags argument to both functions. I think it is simpler to just do the disable/enable from the caller, console_unlock(). BTW, I could not find any sane way of disabling interrupts via a raw_spin_lock_irqsave() of @console_owner_lock because of the how it is used with lockdep. In particular for console_lock_spinning_disable_and_check(). John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH next v1 2/3] printk: remove safe buffers
On 2021-03-22, Petr Mladek wrote: > On Mon 2021-03-22 12:16:15, John Ogness wrote: >> On 2021-03-21, Sergey Senozhatsky wrote: >> >> @@ -369,7 +70,10 @@ __printf(1, 0) int vprintk_func(const char *fmt, >> >> va_list args) >> >>* Use the main logbuf even in NMI. But avoid calling console >> >>* drivers that might have their own locks. >> >>*/ >> >> - if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK)) { >> >> + if (this_cpu_read(printk_context) & >> >> + (PRINTK_NMI_DIRECT_CONTEXT_MASK | >> >> + PRINTK_NMI_CONTEXT_MASK | >> >> + PRINTK_SAFE_CONTEXT_MASK)) { >> > >> > Do we need printk_nmi_direct_enter/exit() and >> > PRINTK_NMI_DIRECT_CONTEXT_MASK? Seems like all printk_safe() paths >> > are now DIRECT - we store messages to the prb, but don't call console >> > drivers. >> >> I was planning on waiting until the kthreads are introduced, in which >> case printk_safe.c is completely removed. > > You want to keep printk_safe() context because it prevents calling > consoles even in normal context. Namely, it prevents deadlock by > recursively taking, for example, sem->lock in console_lock() or > console_owner_lock in console_trylock_spinning(). Am I right? Correct. >> But I suppose I could switch >> the 1 printk_nmi_direct_enter() user to printk_nmi_enter() so that >> PRINTK_NMI_DIRECT_CONTEXT_MASK can be removed now. I would do this in a >> 4th patch of the series. > > Yes, please unify the PRINTK_NMI_CONTEXT. One is enough. Agreed. (But I'll go even further. See below.) > I wonder if it would make sense to go even further at this stage. > There will still be 4 contexts that modify the printk behavior after > this patchset: > > + printk_count set by printk_enter()/exit() > + prevents: infinite recursion > + context: any context > + action: skips entire printk at 3rd recursion level > > + prink_context set by printk_safe_enter()/exit() > + prevents: dead lock caused by recursion into some > console code in any context > + context: any > + action: skips console call at 1st recursion level Technically, at this point printk_safe_enter() behavior is identical to printk_nmi_enter(). Namely, prevent any recursive printk calls from calling into the console code. > + printk_context set by printk_nmi_enter()/exit() > + prevents: dead lock caused by any console lock recursion > + context: NMI > + action: skips console calls at 0th recursion level > > + kdb_trap_printk > + redirects printk() to kdb_printk() in kdb context > > > What is possible? > > 1. We could get rid of printk_nmi_enter()/exit() and >PRINTK_NMI_CONTEXT completely already now. It is enough >to check in_nmi() in printk_func(). > >printk_nmi_enter() was added by the commit 42a0bb3f71383b457a7db362 >("printk/nmi: generic solution for safe printk in NMI"). It was >really needed to modify @printk_func pointer. > >We did not remove it later when printk_function became a real >function. The idea was to track all printk contexts in a single >variable. But we never added kdb context. > >It might make sense to remove it now. Peter Zijstra would be happy. >There already were some churns with tracking printk_context in NMI. >For example, see >https://lore.kernel.org/r/20200219150744.428764...@infradead.org > >IMHO, it does not make sense to wait until the entire console-stuff >rework is done in this case. Agreed. in_nmi() within vprintk_emit() is enough to detect if the console code should be skipped: if (!in_sched && !in_nmi()) { ... } > 2. I thought about unifying printk_safe_enter()/exit() and >printk_enter()/exit(). They both count recursion with >IRQs disabled, have similar name. But they are used >different way. > >But better might be to rename printk_safe_enter()/exit() to >console_enter()/exit() or to printk_deferred_enter()/exit(). >It would make more clear what it does now. And it might help >to better distinguish it from the new printk_enter()/exit(). > >This patchset actually splits the original printk_safe() >functionality into two: > >+ printk_count prevents infinite recursion >+ printk_deferred_enter() deffers console handling. > >I am not sure if it is worth it. But it might help people (even me) >when digging into the printk history. Different name will help to >understand the functionality at the given
Re: [PATCH next v1 2/3] printk: remove safe buffers
On 2021-03-21, Sergey Senozhatsky wrote: >> @@ -369,7 +70,10 @@ __printf(1, 0) int vprintk_func(const char *fmt, va_list >> args) >> * Use the main logbuf even in NMI. But avoid calling console >> * drivers that might have their own locks. >> */ >> -if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK)) { >> +if (this_cpu_read(printk_context) & >> +(PRINTK_NMI_DIRECT_CONTEXT_MASK | >> + PRINTK_NMI_CONTEXT_MASK | >> + PRINTK_SAFE_CONTEXT_MASK)) { > > Do we need printk_nmi_direct_enter/exit() and > PRINTK_NMI_DIRECT_CONTEXT_MASK? Seems like all printk_safe() paths > are now DIRECT - we store messages to the prb, but don't call console > drivers. I was planning on waiting until the kthreads are introduced, in which case printk_safe.c is completely removed. But I suppose I could switch the 1 printk_nmi_direct_enter() user to printk_nmi_enter() so that PRINTK_NMI_DIRECT_CONTEXT_MASK can be removed now. I would do this in a 4th patch of the series. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 3/3] printk: Use %zu to format size_t
On 2021-03-17, Geert Uytterhoeven wrote: > When compiling for 32-bit: > > util_lib/elf_info.c: In function ‘dump_dmesg_lockless’: > util_lib/elf_info.c:1095:39: warning: format ‘%lu’ expects argument of > type ‘long unsigned int’, but argument 3 has type ‘size_t’ {aka ‘unsigned > int’} [-Wformat=] > 1095 | fprintf(stderr, "Failed to malloc %lu bytes for prb: %s\n", > | ~~^ > | | > | long unsigned int > | %u > 1096 |printk_ringbuffer_sz, strerror(errno)); > | > || > |size_t {aka unsigned int} > util_lib/elf_info.c:1101:49: warning: format ‘%lu’ expects > argument of type ‘long unsigned int’, but argument 3 has type ‘size_t’ > {aka ‘unsigned int’} [-Wformat=] > 1101 | fprintf(stderr, "Failed to read prb of size %lu bytes: %s\n", > | ~~^ > | | > | long unsigned int > | %u > 1102 |printk_ringbuffer_sz, strerror(errno)); > | > || > |size_t {aka unsigned int} > > Indeed, "size_t" is "unsigned int" on 32-bit platforms, and "unsigned > long" on 64-bit platforms. > > Fix this by formatting using "%zu". > > Fixes: 4149df9005f2cdd2 ("printk: add support for lockless ringbuffer") > Signed-off-by: Geert Uytterhoeven Reviewed-by: John Ogness > --- > util_lib/elf_info.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c > index 7c0a2c345379a7ca..676926ca8c5f3766 100644 > --- a/util_lib/elf_info.c > +++ b/util_lib/elf_info.c > @@ -1092,13 +1092,13 @@ static void dump_dmesg_lockless(int fd, void > (*handler)(char*, unsigned int)) > kaddr = read_file_pointer(fd, vaddr_to_offset(prb_vaddr)); > m.prb = calloc(1, printk_ringbuffer_sz); > if (!m.prb) { > - fprintf(stderr, "Failed to malloc %lu bytes for prb: %s\n", > + fprintf(stderr, "Failed to malloc %zu bytes for prb: %s\n", > printk_ringbuffer_sz, strerror(errno)); > exit(64); > } > ret = pread(fd, m.prb, printk_ringbuffer_sz, vaddr_to_offset(kaddr)); > if (ret != printk_ringbuffer_sz) { > - fprintf(stderr, "Failed to read prb of size %lu bytes: %s\n", > + fprintf(stderr, "Failed to read prb of size %zu bytes: %s\n", > printk_ringbuffer_sz, strerror(errno)); > exit(65); > } ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/3] printk: Use ULL suffix for 64-bit constants
On 2021-03-17, Geert Uytterhoeven wrote: > When compiling for 32-bit: > > util_lib/elf_info.c: In function ‘get_desc_state’: > util_lib/elf_info.c:923:31: warning: left shift count >= width of type > [-Wshift-count-overflow] > 923 | #define DESC_FLAGS_MASK (3UL << DESC_FLAGS_SHIFT) > | ^~ > util_lib/elf_info.c:925:25: note: in expansion of macro ‘DESC_FLAGS_MASK’ > 925 | #define DESC_ID_MASK (~DESC_FLAGS_MASK) > | ^~~ > util_lib/elf_info.c:926:30: note: in expansion of macro ‘DESC_ID_MASK’ > 926 | #define DESC_ID(sv) ((sv) & DESC_ID_MASK) > | ^~~~ > util_lib/elf_info.c:947:12: note: in expansion of macro ‘DESC_ID’ > 947 | if (id != DESC_ID(state_val)) > |^~~ > util_lib/elf_info.c: In function ‘id_inc’: > util_lib/elf_info.c:923:31: warning: left shift count >= width of type > [-Wshift-count-overflow] > 923 | #define DESC_FLAGS_MASK (3UL << DESC_FLAGS_SHIFT) > | ^~ > util_lib/elf_info.c:925:25: note: in expansion of macro ‘DESC_FLAGS_MASK’ > 925 | #define DESC_ID_MASK (~DESC_FLAGS_MASK) > | ^~~ > util_lib/elf_info.c:981:15: note: in expansion of macro ‘DESC_ID_MASK’ > 981 | return (id & DESC_ID_MASK); > | ^~~~ > > Indeed, "unsigned long" constants are 32-bit on 32-bit platforms, and > 64-bit on 64-bit platforms. > > Fix this by using a "ULL" suffix instead. > > Fixes: 4149df9005f2cdd2 ("printk: add support for lockless ringbuffer") > Signed-off-by: Geert Uytterhoeven Reviewed-by: John Ogness > --- > util_lib/elf_info.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c > index 2f23a448da08ebdd..7c0a2c345379a7ca 100644 > --- a/util_lib/elf_info.c > +++ b/util_lib/elf_info.c > @@ -920,8 +920,8 @@ enum desc_state { > > #define DESC_SV_BITS (sizeof(uint64_t) * 8) > #define DESC_FLAGS_SHIFT (DESC_SV_BITS - 2) > -#define DESC_FLAGS_MASK (3UL << DESC_FLAGS_SHIFT) > -#define DESC_STATE(sv) (3UL & (sv >> DESC_FLAGS_SHIFT)) > +#define DESC_FLAGS_MASK (3ULL << DESC_FLAGS_SHIFT) > +#define DESC_STATE(sv) (3ULL & (sv >> DESC_FLAGS_SHIFT)) > #define DESC_ID_MASK (~DESC_FLAGS_MASK) > #define DESC_ID(sv) ((sv) & DESC_ID_MASK) ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH next v1 2/3] printk: remove safe buffers
With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. Although the safe buffers are removed, the NMI and safe context tracking is still in place. In these contexts, store the message immediately but still use irq_work to defer the console printing. Since printk recursion tracking is in place, safe context tracking for most of printk is not needed. Remove it. Only safe context tracking relating to the console lock is left in place. This is because the console lock is needed for the actual printing. Signed-off-by: John Ogness --- arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - include/linux/printk.h | 10 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 2 - kernel/printk/printk.c | 81 ++-- kernel/printk/printk_safe.c| 332 + lib/nmi_backtrace.c| 6 - 9 files changed, 18 insertions(+), 423 deletions(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index a44a30b0688c..5828c83eaca6 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -171,7 +171,6 @@ extern void panic_flush_kmsg_start(void) extern void panic_flush_kmsg_end(void) { - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); bust_spinlocks(0); debug_locks_off(); diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c index c9a8f4781a10..dc17d8903d4f 100644 --- a/arch/powerpc/kernel/watchdog.c +++ b/arch/powerpc/kernel/watchdog.c @@ -183,11 +183,6 @@ static void watchdog_smp_panic(int cpu, u64 tb) wd_smp_unlock(&flags); - printk_safe_flush(); - /* -* printk_safe_flush() seems to require another print -* before anything actually goes out to console. -*/ if (sysctl_hardlockup_all_cpu_backtrace) trigger_allbutself_cpu_backtrace(); diff --git a/include/linux/printk.h b/include/linux/printk.h index fe7eb2351610..2476796c1150 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); void dump_stack_print_info(const char *log_lvl); void show_regs_print_info(const char *log_lvl); extern asmlinkage void dump_stack(void) __cold; -extern void printk_safe_flush(void); -extern void printk_safe_flush_on_panic(void); #else static inline __printf(1, 0) int vprintk(const char *s, va_list args) @@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char *log_lvl) static inline void dump_stack(void) { } - -static inline void printk_safe_flush(void) -{ -} - -static inline void printk_safe_flush_on_panic(void) -{ -} #endif extern int kptr_restrict; diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index f04d04d1b855..64bf5d5cdd06 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -977,7 +977,6 @@ void crash_kexec(struct pt_regs *regs) old_cpu = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu); if (old_cpu == PANIC_CPU_INVALID) { /* This is the 1st CPU which comes here, so go ahead. */ - printk_safe_flush_on_panic(); __crash_kexec(regs); /* diff --git a/kernel/panic.c b/kernel/panic.c index 332736a72a58..1f0df42f8d0c 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -247,7 +247,6 @@ void panic(const char *fmt, ...) * Bypass the panic_cpu check and call __crash_kexec directly. */ if (!_crash_kexec_post_notifiers) { - printk_safe_flush_on_panic(); __crash_kexec(NULL); /* @@ -271,8 +270,6 @@ void panic(const char *fmt, ...) */ atomic_notifier_call_chain(&panic_notifier_list, 0, buf); - /* Call flush even twice. It tries harder with a single online CPU */ - printk_safe_flush_on_panic(); kmsg_dump(KMSG_DUMP_PANIC); /* diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index e7acc2888c8e..e108b2ece8c7 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -23,7 +23,6 @@ __printf(1, 0) int vprintk_func(const char *fmt, va_list args); void __printk_safe_enter(void); void __printk_safe_exit(void); -void printk_safe_init(void); bool printk_percpu_data_ready(void); #define printk_safe_enter_irqsave(flags) \ @@ -67,6 +66,5 @@ __printf(1, 0) int vprintk_func(const char *fmt, va_list args) { return 0; } #define printk_safe_enter_irq() local_irq_disable() #define printk_safe_exit_irq() local_irq_enable() -static inline void printk_safe_init(void) { } static inline bool printk_percpu_data_ready(void) { return false; } #endif /* CONFIG_PRINTK */ diff --
[PATCH next v1 0/3] printk: remove safe buffers
Hello, Here is v1 of a series to remove the safe buffers. They are no longer needed because messages can be stored directly into the log buffer from any context. However, the safe buffers also provided a form of recursion protection. For that reason, explicit recursion protection is also implemented for this series. This series falls in line with the printk-rework plan as presented [0] at Linux Plumbers in Lisbon 2019. This series is based on next-20210316. John Ogness [0] https://linuxplumbersconf.org/event/4/contributions/290/attachments/276/463/lpc2019_jogness_printk.pdf (slide 23) John Ogness (3): printk: track/limit recursion printk: remove safe buffers printk: convert @syslog_lock to spin_lock arch/powerpc/kernel/traps.c| 1 - arch/powerpc/kernel/watchdog.c | 5 - include/linux/printk.h | 10 - kernel/kexec_core.c| 1 - kernel/panic.c | 3 - kernel/printk/internal.h | 2 - kernel/printk/printk.c | 171 + kernel/printk/printk_safe.c| 332 + lib/nmi_backtrace.c| 6 - 9 files changed, 100 insertions(+), 431 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: Issue in dmesg time with lockless ring buffer
On 2021-01-22, "J. Avila" wrote: > When doing some internal testing on a 5.10.4 kernel, we found that the > time taken for dmesg seemed to increase from the order of milliseconds > to the order of seconds when the dmesg size approached the ~1.2MB > limit. After doing some digging, we found that by reverting all of the > patches in printk/ up to and including > 896fbe20b4e2333fb55cc9b9b783ebcc49eee7c7 ("use the lockless > ringbuffer"), we were able to once more see normal dmesg times. > > This kernel had no meaningful diffs in the printk/ dir when compared > to Linus' tree. This behavior was consistently reproducible using the > following steps: > > 1) In one shell, run "time dmesg > /dev/null" > 2) In another, constantly write to /dev/kmsg > > Within ~5 minutes, we saw that dmesg times increased to 1 second, only > increasing further from there. Is this a known issue? The last couple days I have tried to reproduce this issue with no success. Is your dmesg using /dev/kmsg or syslog() to read the buffer? Are there any syslog daemons or systemd running? Perhaps you can run your test within an initrd to see if this effect is still visible? John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
RE: [PATCH] makedumpfile: printk: add support for lockless ringbuffer
On 2020-11-24, HAGIO KAZUHITO(萩尾 一仁) wrote: >> After looking more closely, I see that your patch is still using the >> old state flags. With the current version, there is now a value-based >> state field. > > Thank you for pointing it out! Could you submit a follow-up patch? I have attached a follow-up patch. It is pretty much the exact same patch as the one I sent for "crash". John Ogness >From 58396867cb3bfd1ca060cf5eb3a910d7f8c192c2 Mon Sep 17 00:00:00 2001 From: John Ogness Date: Wed, 25 Nov 2020 10:10:31 +0106 Subject: [PATCH] printk: use committed/finalized state values The ringbuffer entries use 2 state values (committed and finalized) rather than a single flag to represent being available for reading. Copy the definitions and state lookup function directly from the kernel source and use the new states. Signed-off-by: John Ogness --- printk.c | 48 +--- 1 file changed, 41 insertions(+), 7 deletions(-) diff --git a/printk.c b/printk.c index 8e00901..9cecbd1 100644 --- a/printk.c +++ b/printk.c @@ -1,12 +1,6 @@ #include "makedumpfile.h" #include -#define DESC_SV_BITS (sizeof(unsigned long) * 8) -#define DESC_COMMITTED_MASK (1UL << (DESC_SV_BITS - 1)) -#define DESC_REUSE_MASK (1UL << (DESC_SV_BITS - 2)) -#define DESC_FLAGS_MASK (DESC_COMMITTED_MASK | DESC_REUSE_MASK) -#define DESC_ID_MASK (~DESC_FLAGS_MASK) - /* convenience struct for passing many values to helper functions */ struct prb_map { char *prb; @@ -21,12 +15,51 @@ struct prb_map { char *text_data; }; +/* + * desc_state and DESC_* definitions taken from kernel source: + * + * kernel/printk/printk_ringbuffer.h + */ + +/* The possible responses of a descriptor state-query. */ +enum desc_state { + desc_miss = -1, /* ID mismatch (pseudo state) */ + desc_reserved = 0x0, /* reserved, in use by writer */ + desc_committed = 0x1, /* committed by writer, could get reopened */ + desc_finalized = 0x2, /* committed, no further modification allowed */ + desc_reusable = 0x3, /* free, not yet used by any writer */ +}; + +#define DESC_SV_BITS (sizeof(unsigned long) * 8) +#define DESC_FLAGS_SHIFT (DESC_SV_BITS - 2) +#define DESC_FLAGS_MASK (3UL << DESC_FLAGS_SHIFT) +#define DESC_STATE(sv) (3UL & (sv >> DESC_FLAGS_SHIFT)) +#define DESC_ID_MASK (~DESC_FLAGS_MASK) +#define DESC_ID(sv) ((sv) & DESC_ID_MASK) + +/* + * get_desc_state() taken from kernel source: + * + * kernel/printk/printk_ringbuffer.c + */ + +/* Query the state of a descriptor. */ +static enum desc_state get_desc_state(unsigned long id, + unsigned long state_val) +{ + if (id != DESC_ID(state_val)) + return desc_miss; + + return DESC_STATE(state_val); +} + static void dump_record(struct prb_map *m, unsigned long id) { unsigned long long ts_nsec; unsigned long state_var; unsigned short text_len; + enum desc_state state; unsigned long begin; unsigned long next; char buf[BUFSIZE]; @@ -45,7 +78,8 @@ dump_record(struct prb_map *m, unsigned long id) /* skip non-committed record */ state_var = ULONG(desc + OFFSET(prb_desc.state_var) + OFFSET(atomic_long_t.counter)); - if ((state_var & DESC_FLAGS_MASK) != DESC_COMMITTED_MASK) + state = get_desc_state(id, state_var); + if (state != desc_committed && state != desc_finalized) return; begin = ULONG(desc + OFFSET(prb_desc.text_blk_lpos) + OFFSET(prb_data_blk_lpos.begin)) % -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH] printk: add support for lockless ringbuffer
Linux 5.10 moved to a new lockless ringbuffer. The new ringbuffer is structured completely different to the previous iterations. Add support for retrieving the ringbuffer using vmcoreinfo. The new ringbuffer is detected based on the availability of the "prb" symbol. Signed-off-by: John Ogness --- util_lib/elf_info.c | 438 +++- 1 file changed, 437 insertions(+), 1 deletion(-) diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c index 7803a94..2f23a44 100644 --- a/util_lib/elf_info.c +++ b/util_lib/elf_info.c @@ -27,6 +27,32 @@ static int num_pt_loads; static char osrelease[4096]; +/* VMCOREINFO symbols for lockless printk ringbuffer */ +static loff_t prb_vaddr; +static size_t printk_ringbuffer_sz; +static size_t prb_desc_sz; +static size_t printk_info_sz; +static uint64_t printk_ringbuffer_desc_ring_offset; +static uint64_t printk_ringbuffer_text_data_ring_offset; +static uint64_t prb_desc_ring_count_bits_offset; +static uint64_t prb_desc_ring_descs_offset; +static uint64_t prb_desc_ring_infos_offset; +static uint64_t prb_data_ring_size_bits_offset; +static uint64_t prb_data_ring_data_offset; +static uint64_t prb_desc_ring_head_id_offset; +static uint64_t prb_desc_ring_tail_id_offset; +static uint64_t atomic_long_t_counter_offset; +static uint64_t prb_desc_state_var_offset; +static uint64_t prb_desc_info_offset; +static uint64_t prb_desc_text_blk_lpos_offset; +static uint64_t prb_data_blk_lpos_begin_offset; +static uint64_t prb_data_blk_lpos_next_offset; +static uint64_t printk_info_seq_offset; +static uint64_t printk_info_caller_id_offset; +static uint64_t printk_info_ts_nsec_offset; +static uint64_t printk_info_level_offset; +static uint64_t printk_info_text_len_offset; + static loff_t log_buf_vaddr; static loff_t log_end_vaddr; static loff_t log_buf_len_vaddr; @@ -304,6 +330,7 @@ void scan_vmcoreinfo(char *start, size_t size) size_t len; loff_t *vaddr; } symbol[] = { + SYMBOL(prb), SYMBOL(log_buf), SYMBOL(log_end), SYMBOL(log_buf_len), @@ -361,6 +388,119 @@ void scan_vmcoreinfo(char *start, size_t size) *symbol[i].vaddr = vaddr; } + str = "SIZE(printk_ringbuffer)="; + if (memcmp(str, pos, strlen(str)) == 0) + printk_ringbuffer_sz = strtoull(pos + strlen(str), + NULL, 10); + + str = "SIZE(prb_desc)="; + if (memcmp(str, pos, strlen(str)) == 0) + prb_desc_sz = strtoull(pos + strlen(str), NULL, 10); + + str = "SIZE(printk_info)="; + if (memcmp(str, pos, strlen(str)) == 0) + printk_info_sz = strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(printk_ringbuffer.desc_ring)="; + if (memcmp(str, pos, strlen(str)) == 0) + printk_ringbuffer_desc_ring_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(printk_ringbuffer.text_data_ring)="; + if (memcmp(str, pos, strlen(str)) == 0) + printk_ringbuffer_text_data_ring_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(prb_desc_ring.count_bits)="; + if (memcmp(str, pos, strlen(str)) == 0) + prb_desc_ring_count_bits_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(prb_desc_ring.descs)="; + if (memcmp(str, pos, strlen(str)) == 0) + prb_desc_ring_descs_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(prb_desc_ring.infos)="; + if (memcmp(str, pos, strlen(str)) == 0) + prb_desc_ring_infos_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(prb_data_ring.size_bits)="; + if (memcmp(str, pos, strlen(str)) == 0) + prb_data_ring_size_bits_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(prb_data_ring.data)="; + if (memcmp(str, pos, strlen(str)) == 0) + prb_data_ring_data_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(prb_desc_ring.head_id)="; + if (memcmp(str, pos, strlen(str)) == 0) + prb_desc_ring_head_id_offset = + strtoull(pos + strlen(str), NULL, 10); + + str = "OFFSET(prb_desc_
RE: [PATCH] makedumpfile: printk: add support for lockless ringbuffer
Hi Kazu, On 2020-11-20, HAGIO KAZUHITO(萩尾 一仁) wrote: > Thank you for confirming and testing. > I will merge this after a few slight fixes and more tests. After looking more closely, I see that your patch is still using the old state flags. With the current version, there is now a value-based state field. Both state values 1 (committed) and 2 (finalized) are valid for printing. Should I submit a follow-up patch? Or are these the "slight fixes" you are referring to? John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH] makedumpfile: printk: add support for lockless ringbuffer
On 2020-11-19, HAGIO KAZUHITO(萩尾 一仁) wrote: > From: John Ogness > > Linux 5.10 introduces a new lockless ringbuffer. The new ringbuffer > is structured completely different to the previous iterations. > Add support for retrieving the ringbuffer from debug information > and/or using vmcoreinfo. The new ringbuffer is detected based on > the availability of the "prb" symbol. > > Signed-off-by: John Ogness > Signed-off-by: Kazuhito Hagio > --- > I've updated John's RFC makedumpfile patch to match 5.10-rc4 kernel. > Changes from the RFC patch: > - followed the following kernel commit > cfe2790b163a ("printk: move printk_info into separate array") > - divided members of struct printk_log in offset_table into each structure > for readability > - added some error handlings > - also dump head record that was missed I confirm that these changes are correct. Thanks for updating this, adding the needed error handling, and catching that the head record was missed! I tested this by: 1. Boot kernel with: crashkernel=512M 2. Setup and trigger crash: kexec -p /boot/bzImage --initrd=/boot/rescue-initrd --append="console=ttyS0,115200" echo c > /proc/sysrq-trigger 3. From rescue environment, copy crashed vmcore to external machine: cp /proc/vmcore /remote/nfs/mount/ 4. From external machine, extract kernel log using vmcoreinfo: makedumpfile -g ./vmcoreinfo -x ./vmlinux makedumpfile --dump-dmesg -i ./vmcoreinfo ./vmcore dmesg1.txt 5. From external machine, extract kernel log using debug symbols: makedumpfile --dump-dmesg -x ./vmlinux ./vmcore dmesg2.txt 6. Compare and inspect the kernel logs: diff dmesg1.txt dmesg2.txt cat dmesg1.txt John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH printk v5 6/6] printk: reimplement log_cont using record extension
On 2020-09-25, Marek Szyprowski wrote: > This patch landed recently in linux-next as commit f5f022e53b87 > ("printk: reimplement log_cont using record extension"). I've noticed > that it causes a regression on my test system (ARM 32bit Samsung Exynos > 4412-based Trats2 board). The messages are printed correctly on the > serial console during boot, but then when I run 'dmesg' command, the log > is truncated. > > Here is are the last lines of the dmesg log after this patch: > > [ 6.649018] Waiting 2 sec before mounting root device... > [ 6.766423] dwc2 1248.hsotg: new device is high-speed > [ 6.845290] dwc2 1248.hsotg: new device is high-speed > [ 6.914217] dwc2 1248.hsotg: new address 51 > [ 8.710351] RAMDISK: squashfs filesystem found at block 0 > > The corresponding dmesg lines before applying this patch: > > [ 8.864320] RAMDISK: squashfs filesystem found at block 0 > [ 8.868410] RAMDISK: Loading 37692KiB [1 disk] into ram disk... / > [ 9.071670] / > [ 9.262498] / > [ 9.540711] / > [ 9.818031] done. Ah. One of the more creative printk users... init/do_mounts_rd.c:rd_load_image(). This is a set of LOG_CONT messages that try to display a rotating line, complete with '\b' control characters. The code is totally broken, but that is no excuse for printk to break. It should be easy to reproduce on any architecture. I will investigate it further. Thanks for reporting. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v4 2/3] printk: move dictionary keys to dev_printk_info
Dictionaries are only used for SUBSYSTEM and DEVICE properties. The current implementation stores the property names each time they are used. This requires more space than otherwise necessary. Also, because the dictionary entries are currently considered optional, it cannot be relied upon that they are always available, even if the writer wanted to store them. These issues will increase should new dictionary properties be introduced. Rather than storing the subsystem and device properties in the dict ring, introduce a struct dev_printk_info with separate fields to store only the property values. Embed this struct within the struct printk_info to provide guaranteed availability. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- Sorry. v3 did not include Petr's fixup correctly. @size was wrong. Now it is correct. Documentation/admin-guide/kdump/gdbmacros.txt | 73 drivers/base/core.c | 46 ++--- include/linux/dev_printk.h| 8 + include/linux/printk.h| 6 +- kernel/printk/internal.h | 4 +- kernel/printk/printk.c| 166 +- kernel/printk/printk_ringbuffer.h | 3 + kernel/printk/printk_safe.c | 2 +- scripts/gdb/linux/dmesg.py| 16 +- 9 files changed, 164 insertions(+), 160 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 94fabb165abf..82aecdcae8a6 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -172,13 +172,13 @@ end define dump_record set var $desc = $arg0 - if ($argc > 1) - set var $prev_flags = $arg1 + set var $info = $arg1 + if ($argc > 2) + set var $prev_flags = $arg2 else set var $prev_flags = 0 end - set var $info = &$desc->info set var $prefix = 1 set var $newline = 1 @@ -237,44 +237,36 @@ define dump_record # handle dictionary data - set var $begin = $desc->dict_blk_lpos.begin % (1U << prb->dict_data_ring.size_bits) - set var $next = $desc->dict_blk_lpos.next % (1U << prb->dict_data_ring.size_bits) - - # handle data-less record - if ($begin & 1) - set var $dict_len = 0 - set var $dict = "" - else - # handle wrapping data block - if ($begin > $next) - set var $begin = 0 - end - - # skip over descriptor id - set var $begin = $begin + sizeof(long) - - # handle truncated message - if ($next - $begin < $info->dict_len) - set var $dict_len = $next - $begin - else - set var $dict_len = $info->dict_len + set var $dict = &$info->dev_info.subsystem[0] + set var $dict_len = sizeof($info->dev_info.subsystem) + if ($dict[0] != '\0') + printf " SUBSYSTEM=" + set var $idx = 0 + while ($idx < $dict_len) + set var $c = $dict[$idx] + if ($c == '\0') + loop_break + else + if ($c < ' ' || $c >= 127 || $c == '\\') + printf "\\x%02x", $c + else + printf "%c", $c + end + end + set var $idx = $idx + 1 end - - set var $dict = &prb->dict_data_ring.data[$begin] + printf "\n" end - if ($dict_len > 0) + set var $dict = &$info->dev_info.device[0] + set var $dict_len = sizeof($info->dev_info.device) + if ($dict[0] != '\0') + printf " DEVICE=" set var $idx = 0 - set var $line = 1 while ($idx < $dict_len) - if ($line) - printf " " - set var $line = 0 - end set var $c = $dict[$idx] if ($c == '\0') - printf "\n" - set var $line = 1 + loop_break else if ($c < ' ' || $c >= 127 || $c == '\\') printf "\\x%02x", $c
[PATCH printk v3 2/3] printk: move dictionary keys to dev_printk_info
Dictionaries are only used for SUBSYSTEM and DEVICE properties. The current implementation stores the property names each time they are used. This requires more space than otherwise necessary. Also, because the dictionary entries are currently considered optional, it cannot be relied upon that they are always available, even if the writer wanted to store them. These issues will increase should new dictionary properties be introduced. Rather than storing the subsystem and device properties in the dict ring, introduce a struct dev_printk_info with separate fields to store only the property values. Embed this struct within the struct printk_info to provide guaranteed availability. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- Added Petr's fixup for msg_add_dict_text() to include the prefix whitespace for dictionary properties. Thanks! Documentation/admin-guide/kdump/gdbmacros.txt | 73 drivers/base/core.c | 46 ++--- include/linux/dev_printk.h| 8 + include/linux/printk.h| 6 +- kernel/printk/internal.h | 4 +- kernel/printk/printk.c| 166 +- kernel/printk/printk_ringbuffer.h | 3 + kernel/printk/printk_safe.c | 2 +- scripts/gdb/linux/dmesg.py| 16 +- 9 files changed, 164 insertions(+), 160 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 94fabb165abf..82aecdcae8a6 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -172,13 +172,13 @@ end define dump_record set var $desc = $arg0 - if ($argc > 1) - set var $prev_flags = $arg1 + set var $info = $arg1 + if ($argc > 2) + set var $prev_flags = $arg2 else set var $prev_flags = 0 end - set var $info = &$desc->info set var $prefix = 1 set var $newline = 1 @@ -237,44 +237,36 @@ define dump_record # handle dictionary data - set var $begin = $desc->dict_blk_lpos.begin % (1U << prb->dict_data_ring.size_bits) - set var $next = $desc->dict_blk_lpos.next % (1U << prb->dict_data_ring.size_bits) - - # handle data-less record - if ($begin & 1) - set var $dict_len = 0 - set var $dict = "" - else - # handle wrapping data block - if ($begin > $next) - set var $begin = 0 - end - - # skip over descriptor id - set var $begin = $begin + sizeof(long) - - # handle truncated message - if ($next - $begin < $info->dict_len) - set var $dict_len = $next - $begin - else - set var $dict_len = $info->dict_len + set var $dict = &$info->dev_info.subsystem[0] + set var $dict_len = sizeof($info->dev_info.subsystem) + if ($dict[0] != '\0') + printf " SUBSYSTEM=" + set var $idx = 0 + while ($idx < $dict_len) + set var $c = $dict[$idx] + if ($c == '\0') + loop_break + else + if ($c < ' ' || $c >= 127 || $c == '\\') + printf "\\x%02x", $c + else + printf "%c", $c + end + end + set var $idx = $idx + 1 end - - set var $dict = &prb->dict_data_ring.data[$begin] + printf "\n" end - if ($dict_len > 0) + set var $dict = &$info->dev_info.device[0] + set var $dict_len = sizeof($info->dev_info.device) + if ($dict[0] != '\0') + printf " DEVICE=" set var $idx = 0 - set var $line = 1 while ($idx < $dict_len) - if ($line) - printf " " - set var $line = 0 - end set var $c = $dict[$idx] if ($c == '\0') - printf "\n" - set var $line = 1 + loop_break else if ($c < ' ' || $c >= 127 || $c == '\\') printf "\\x%02
[PATCH printk v2 1/3] printk: move printk_info into separate array
The majority of the size of a descriptor is taken up by meta data, which is often not of interest to the ringbuffer (for example, when performing state checks). Since descriptors are often temporarily stored on the stack, keeping their size minimal will help reduce stack pressure. Rather than embedding the printk_info into the descriptor, create a separate printk_info array. The index of a descriptor in the descriptor array corresponds to the printk_info with the same index in the printk_info array. The rules for validity of a printk_info match the existing rules for the data blocks: the descriptor must be in a consistent state. Signed-off-by: John Ogness --- kernel/printk/printk.c| 30 +-- kernel/printk/printk_ringbuffer.c | 145 +++--- kernel/printk/printk_ringbuffer.h | 29 +++--- 3 files changed, 133 insertions(+), 71 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 9a2e23191576..25cfe4fe48af 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -959,11 +959,11 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_STRUCT_SIZE(prb_desc_ring); VMCOREINFO_OFFSET(prb_desc_ring, count_bits); VMCOREINFO_OFFSET(prb_desc_ring, descs); + VMCOREINFO_OFFSET(prb_desc_ring, infos); VMCOREINFO_OFFSET(prb_desc_ring, head_id); VMCOREINFO_OFFSET(prb_desc_ring, tail_id); VMCOREINFO_STRUCT_SIZE(prb_desc); - VMCOREINFO_OFFSET(prb_desc, info); VMCOREINFO_OFFSET(prb_desc, state_var); VMCOREINFO_OFFSET(prb_desc, text_blk_lpos); VMCOREINFO_OFFSET(prb_desc, dict_blk_lpos); @@ -1097,11 +1097,13 @@ static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata; void __init setup_log_buf(int early) { + struct printk_info *new_infos; unsigned int new_descs_count; struct prb_desc *new_descs; struct printk_info info; struct printk_record r; size_t new_descs_size; + size_t new_infos_size; unsigned long flags; char *new_dict_buf; char *new_log_buf; @@ -1142,8 +1144,7 @@ void __init setup_log_buf(int early) if (unlikely(!new_dict_buf)) { pr_err("log_buf_len: %lu dict bytes not available\n", new_log_buf_len); - memblock_free(__pa(new_log_buf), new_log_buf_len); - return; + goto err_free_log_buf; } new_descs_size = new_descs_count * sizeof(struct prb_desc); @@ -1151,9 +1152,15 @@ void __init setup_log_buf(int early) if (unlikely(!new_descs)) { pr_err("log_buf_len: %zu desc bytes not available\n", new_descs_size); - memblock_free(__pa(new_dict_buf), new_log_buf_len); - memblock_free(__pa(new_log_buf), new_log_buf_len); - return; + goto err_free_dict_buf; + } + + new_infos_size = new_descs_count * sizeof(struct printk_info); + new_infos = memblock_alloc(new_infos_size, LOG_ALIGN); + if (unlikely(!new_infos)) { + pr_err("log_buf_len: %zu info bytes not available\n", + new_infos_size); + goto err_free_descs; } prb_rec_init_rd(&r, &info, @@ -1163,7 +1170,8 @@ void __init setup_log_buf(int early) prb_init(&printk_rb_dynamic, new_log_buf, ilog2(new_log_buf_len), new_dict_buf, ilog2(new_log_buf_len), -new_descs, ilog2(new_descs_count)); +new_descs, ilog2(new_descs_count), +new_infos); logbuf_lock_irqsave(flags); @@ -1192,6 +1200,14 @@ void __init setup_log_buf(int early) pr_info("log_buf_len: %u bytes\n", log_buf_len); pr_info("early log buf free: %u(%u%%)\n", free, (free * 100) / __LOG_BUF_LEN); + return; + +err_free_descs: + memblock_free(__pa(new_descs), new_descs_size); +err_free_dict_buf: + memblock_free(__pa(new_dict_buf), new_log_buf_len); +err_free_log_buf: + memblock_free(__pa(new_log_buf), new_log_buf_len); } static bool __read_mostly ignore_loglevel; diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index f4e2e9890e0f..de4b10a98623 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -15,10 +15,10 @@ * The printk_ringbuffer is made up of 3 internal ringbuffers: * * desc_ring - * A ring of descriptors. A descriptor contains all record meta data - * (sequence number, timestamp, loglevel, etc.) as well as internal state - * information about the record and logical positions specifying where in - * the other ringbuffers the text and dictionary strings are located. + * A ring of descriptors and their meta data (such as sequence number, + * timestamp, loglevel, etc.) as w
[PATCH printk v2 3/3] printk: remove dict ring
Since there is no code that will ever store anything into the dict ring, remove it. If any future dictionary properties are to be added, these should be added to the struct printk_info. Signed-off-by: John Ogness --- kernel/printk/printk.c| 46 +++-- kernel/printk/printk_ringbuffer.c | 155 +++--- kernel/printk/printk_ringbuffer.h | 63 +++- 3 files changed, 64 insertions(+), 200 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 269f0abd1ddf..77660354a7c5 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -427,7 +427,6 @@ static u32 log_buf_len = __LOG_BUF_LEN; * Define the average message size. This only affects the number of * descriptors that will be available. Underestimating is better than * overestimating (too many available descriptors is better than not enough). - * The dictionary buffer will be the same size as the text buffer. */ #define PRB_AVGBITS 5 /* 32 character average length */ @@ -435,7 +434,7 @@ static u32 log_buf_len = __LOG_BUF_LEN; #error CONFIG_LOG_BUF_SHIFT value too small. #endif _DEFINE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS, -PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]); +PRB_AVGBITS, &__log_buf[0]); static struct printk_ringbuffer printk_rb_dynamic; @@ -502,12 +501,12 @@ static int log_store(u32 caller_id, int facility, int level, struct printk_record r; u16 trunc_msg_len = 0; - prb_rec_init_wr(&r, text_len, 0); + prb_rec_init_wr(&r, text_len); if (!prb_reserve(&e, prb, &r)) { /* truncate the message if it is too long for empty buffer */ truncate_msg(&text_len, &trunc_msg_len); - prb_rec_init_wr(&r, text_len + trunc_msg_len, 0); + prb_rec_init_wr(&r, text_len + trunc_msg_len); /* survive when the log buffer is too small for trunc_msg */ if (!prb_reserve(&e, prb, &r)) return 0; @@ -897,8 +896,7 @@ static int devkmsg_open(struct inode *inode, struct file *file) mutex_init(&user->lock); prb_rec_init_rd(&user->record, &user->info, - &user->text_buf[0], sizeof(user->text_buf), - NULL, 0); + &user->text_buf[0], sizeof(user->text_buf)); logbuf_lock_irq(); user->seq = prb_first_valid_seq(prb); @@ -956,7 +954,6 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_STRUCT_SIZE(printk_ringbuffer); VMCOREINFO_OFFSET(printk_ringbuffer, desc_ring); VMCOREINFO_OFFSET(printk_ringbuffer, text_data_ring); - VMCOREINFO_OFFSET(printk_ringbuffer, dict_data_ring); VMCOREINFO_OFFSET(printk_ringbuffer, fail); VMCOREINFO_STRUCT_SIZE(prb_desc_ring); @@ -969,7 +966,6 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_STRUCT_SIZE(prb_desc); VMCOREINFO_OFFSET(prb_desc, state_var); VMCOREINFO_OFFSET(prb_desc, text_blk_lpos); - VMCOREINFO_OFFSET(prb_desc, dict_blk_lpos); VMCOREINFO_STRUCT_SIZE(prb_data_blk_lpos); VMCOREINFO_OFFSET(prb_data_blk_lpos, begin); @@ -979,7 +975,6 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_OFFSET(printk_info, seq); VMCOREINFO_OFFSET(printk_info, ts_nsec); VMCOREINFO_OFFSET(printk_info, text_len); - VMCOREINFO_OFFSET(printk_info, dict_len); VMCOREINFO_OFFSET(printk_info, caller_id); VMCOREINFO_OFFSET(printk_info, dev_info); @@ -1080,7 +1075,7 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, struct prb_reserved_entry e; struct printk_record dest_r; - prb_rec_init_wr(&dest_r, r->info->text_len, 0); + prb_rec_init_wr(&dest_r, r->info->text_len); if (!prb_reserve(&e, rb, &dest_r)) return 0; @@ -,7 +1106,6 @@ void __init setup_log_buf(int early) size_t new_descs_size; size_t new_infos_size; unsigned long flags; - char *new_dict_buf; char *new_log_buf; unsigned int free; u64 seq; @@ -1146,19 +1140,12 @@ void __init setup_log_buf(int early) return; } - new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN); - if (unlikely(!new_dict_buf)) { - pr_err("log_buf_len: %lu dict bytes not available\n", - new_log_buf_len); - goto err_free_log_buf; - } - new_descs_size = new_descs_count * sizeof(struct prb_desc); new_descs = memblock_alloc(new_descs_size, LOG_ALIGN); if (unlikely(!new_descs)) { pr_err("log_buf_len: %zu desc bytes not available\n", new_descs_size); - goto err_free_dict_buf; +
[PATCH printk v2 2/3] printk: move dictionary keys to dev_printk_info
Dictionaries are only used for SUBSYSTEM and DEVICE properties. The current implementation stores the property names each time they are used. This requires more space than otherwise necessary. Also, because the dictionary entries are currently considered optional, it cannot be relied upon that they are always available, even if the writer wanted to store them. These issues will increase should new dictionary properties be introduced. Rather than storing the subsystem and device properties in the dict ring, introduce a struct dev_printk_info with separate fields to store only the property values. Embed this struct within the struct printk_info to provide guaranteed availability. Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 73 drivers/base/core.c | 46 ++--- include/linux/dev_printk.h| 8 + include/linux/printk.h| 6 +- kernel/printk/internal.h | 4 +- kernel/printk/printk.c| 165 +- kernel/printk/printk_ringbuffer.h | 3 + kernel/printk/printk_safe.c | 2 +- scripts/gdb/linux/dmesg.py| 16 +- 9 files changed, 163 insertions(+), 160 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 94fabb165abf..82aecdcae8a6 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -172,13 +172,13 @@ end define dump_record set var $desc = $arg0 - if ($argc > 1) - set var $prev_flags = $arg1 + set var $info = $arg1 + if ($argc > 2) + set var $prev_flags = $arg2 else set var $prev_flags = 0 end - set var $info = &$desc->info set var $prefix = 1 set var $newline = 1 @@ -237,44 +237,36 @@ define dump_record # handle dictionary data - set var $begin = $desc->dict_blk_lpos.begin % (1U << prb->dict_data_ring.size_bits) - set var $next = $desc->dict_blk_lpos.next % (1U << prb->dict_data_ring.size_bits) - - # handle data-less record - if ($begin & 1) - set var $dict_len = 0 - set var $dict = "" - else - # handle wrapping data block - if ($begin > $next) - set var $begin = 0 - end - - # skip over descriptor id - set var $begin = $begin + sizeof(long) - - # handle truncated message - if ($next - $begin < $info->dict_len) - set var $dict_len = $next - $begin - else - set var $dict_len = $info->dict_len + set var $dict = &$info->dev_info.subsystem[0] + set var $dict_len = sizeof($info->dev_info.subsystem) + if ($dict[0] != '\0') + printf " SUBSYSTEM=" + set var $idx = 0 + while ($idx < $dict_len) + set var $c = $dict[$idx] + if ($c == '\0') + loop_break + else + if ($c < ' ' || $c >= 127 || $c == '\\') + printf "\\x%02x", $c + else + printf "%c", $c + end + end + set var $idx = $idx + 1 end - - set var $dict = &prb->dict_data_ring.data[$begin] + printf "\n" end - if ($dict_len > 0) + set var $dict = &$info->dev_info.device[0] + set var $dict_len = sizeof($info->dev_info.device) + if ($dict[0] != '\0') + printf " DEVICE=" set var $idx = 0 - set var $line = 1 while ($idx < $dict_len) - if ($line) - printf " " - set var $line = 0 - end set var $c = $dict[$idx] if ($c == '\0') - printf "\n" - set var $line = 1 + loop_break else if ($c < ' ' || $c >= 127 || $c == '\\') printf "\\x%02x", $c @@ -288,10 +280,10 @@ define dump_record end end document dump_record - Dump a single record. The first parameter is t
[PATCH printk v2 0/3] printk: move dictionaries to meta data
Hello, Here is v2 for a series to move all existing dictionary properties (SUBSYSTEM and DEVICE) into the meta data of a record, thus eliminating the need for the dict ring. This change affects how the dictionaries are stored, but does not affect how they are presented to userspace. (v1 is here [0]). The main purpose of the change is to address concerns [1] about the reliability of dictionary properties as well as allowing to efficiently expand the type and amount of meta data available [2]. This series is based heavily on the proof of concept [3] from Petr Mladek. (Petr, feel free to add Co-developed-by tags.) The series is based on the printk-rework branch of the printk git tree: f5f022e53b87 ("printk: reimplement log_cont using record extension") The list of changes since v1: drivers/base/core.c === - set_dev_info(): use strscpy() instead of snprintf() (thank you Rasmus Villemoes) kernel/printk/printk.c == - setup_log_buf(): fix cleanup in error handling - log_buf_vmcoreinfo_setup(): add VMCOREINFO for struct dev_printk_info array sizes so that crash tools do not need to rely on property value termination John Ogness [0] https://lkml.kernel.org/r/20200917131644.25838-1-john.ogn...@linutronix.de [1] https://lkml.kernel.org/r/20200904151336.GC20558@alley [2] https://lkml.kernel.org/r/008801d684f9$43e1c140$cba543c0$@samsung.com [3] https://lkml.kernel.org/r/20200911095035.GI3864@alley John Ogness (3): printk: move printk_info into separate array printk: move dictionary keys to dev_printk_info printk: remove dict ring Documentation/admin-guide/kdump/gdbmacros.txt | 73 ++--- drivers/base/core.c | 46 +-- include/linux/dev_printk.h| 8 + include/linux/printk.h| 6 +- kernel/printk/internal.h | 4 +- kernel/printk/printk.c| 221 ++--- kernel/printk/printk_ringbuffer.c | 292 -- kernel/printk/printk_ringbuffer.h | 95 ++ kernel/printk/printk_safe.c | 2 +- scripts/gdb/linux/dmesg.py| 16 +- 10 files changed, 346 insertions(+), 417 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH printk 1/3] printk: move printk_info into separate array
On 2020-09-18, Petr Mladek wrote: >> --- a/kernel/printk/printk.c >> +++ b/kernel/printk/printk.c >> @@ -1097,6 +1097,7 @@ static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] >> __initdata; >> >> void __init setup_log_buf(int early) >> { >> +struct printk_info *new_infos; >> unsigned int new_descs_count; >> struct prb_desc *new_descs; >> struct printk_info info; >> @@ -1156,6 +1157,17 @@ void __init setup_log_buf(int early) >> return; >> } >> >> +new_descs_size = new_descs_count * sizeof(struct printk_info); > > Must be stored into new variable, e.g. new_infos_size.= Ack. >> +new_infos = memblock_alloc(new_descs_size, LOG_ALIGN); >> +if (unlikely(!new_infos)) { >> +pr_err("log_buf_len: %zu info bytes not available\n", >> + new_descs_size); >> +memblock_free(__pa(new_descs), new_log_buf_len); >> +memblock_free(__pa(new_dict_buf), new_log_buf_len); > > The above two calls have wrong size. > > The same problem is there also in the error path when new_descs > allocation fail. It might be better to handle this using some > goto err_* tagrets. > > Please, fix the old problem in a separate patch. The "old problem" didn't exist. The problem is introduced with this series. I will fix it with appropriate goto err_* targets for v2. >> --- a/kernel/printk/printk_ringbuffer.c >> +++ b/kernel/printk/printk_ringbuffer.c >> @@ -1726,12 +1762,12 @@ static bool copy_data(struct prb_data_ring >> *data_ring, >> /* >> * Actual cannot be less than expected. It can be more than expected >> * because of the trailing alignment padding. >> + * >> + * Note that invalid @len values can occur because the caller loads >> + * the value during an allowed data race. > > I hope that this will not bite us in the future. The fact is that > copying the entire struct printk_info in get_desc() is ugly and > copy_data() has to be careful anyway. It isn't an issue because the state is verified again at the end of prb_read(). I added the comment because if all you are looking at is copy_data(), you may not know that @len was read on a data-race. Whereas inside of prb_read(), it is obvious that the memcpy() is a data-race. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk 3/3] printk: remove dict ring
Since there is no code that will ever store anything into the dict ring, remove it. If any future dictionary properties are to be added, these should be added to the struct printk_info. Signed-off-by: John Ogness --- kernel/printk/printk.c| 45 +++-- kernel/printk/printk_ringbuffer.c | 155 +++--- kernel/printk/printk_ringbuffer.h | 63 +++- 3 files changed, 63 insertions(+), 200 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index b2e2bdd37028..107c09744026 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -427,7 +427,6 @@ static u32 log_buf_len = __LOG_BUF_LEN; * Define the average message size. This only affects the number of * descriptors that will be available. Underestimating is better than * overestimating (too many available descriptors is better than not enough). - * The dictionary buffer will be the same size as the text buffer. */ #define PRB_AVGBITS 5 /* 32 character average length */ @@ -435,7 +434,7 @@ static u32 log_buf_len = __LOG_BUF_LEN; #error CONFIG_LOG_BUF_SHIFT value too small. #endif _DEFINE_PRINTKRB(printk_rb_static, CONFIG_LOG_BUF_SHIFT - PRB_AVGBITS, -PRB_AVGBITS, PRB_AVGBITS, &__log_buf[0]); +PRB_AVGBITS, &__log_buf[0]); static struct printk_ringbuffer printk_rb_dynamic; @@ -502,12 +501,12 @@ static int log_store(u32 caller_id, int facility, int level, struct printk_record r; u16 trunc_msg_len = 0; - prb_rec_init_wr(&r, text_len, 0); + prb_rec_init_wr(&r, text_len); if (!prb_reserve(&e, prb, &r)) { /* truncate the message if it is too long for empty buffer */ truncate_msg(&text_len, &trunc_msg_len); - prb_rec_init_wr(&r, text_len + trunc_msg_len, 0); + prb_rec_init_wr(&r, text_len + trunc_msg_len); /* survive when the log buffer is too small for trunc_msg */ if (!prb_reserve(&e, prb, &r)) return 0; @@ -897,8 +896,7 @@ static int devkmsg_open(struct inode *inode, struct file *file) mutex_init(&user->lock); prb_rec_init_rd(&user->record, &user->info, - &user->text_buf[0], sizeof(user->text_buf), - NULL, 0); + &user->text_buf[0], sizeof(user->text_buf)); logbuf_lock_irq(); user->seq = prb_first_valid_seq(prb); @@ -954,7 +952,6 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_STRUCT_SIZE(printk_ringbuffer); VMCOREINFO_OFFSET(printk_ringbuffer, desc_ring); VMCOREINFO_OFFSET(printk_ringbuffer, text_data_ring); - VMCOREINFO_OFFSET(printk_ringbuffer, dict_data_ring); VMCOREINFO_OFFSET(printk_ringbuffer, fail); VMCOREINFO_STRUCT_SIZE(prb_desc_ring); @@ -967,7 +964,6 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_STRUCT_SIZE(prb_desc); VMCOREINFO_OFFSET(prb_desc, state_var); VMCOREINFO_OFFSET(prb_desc, text_blk_lpos); - VMCOREINFO_OFFSET(prb_desc, dict_blk_lpos); VMCOREINFO_STRUCT_SIZE(prb_data_blk_lpos); VMCOREINFO_OFFSET(prb_data_blk_lpos, begin); @@ -977,7 +973,6 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_OFFSET(printk_info, seq); VMCOREINFO_OFFSET(printk_info, ts_nsec); VMCOREINFO_OFFSET(printk_info, text_len); - VMCOREINFO_OFFSET(printk_info, dict_len); VMCOREINFO_OFFSET(printk_info, caller_id); VMCOREINFO_OFFSET(printk_info, dev_info); @@ -1076,7 +1071,7 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, struct prb_reserved_entry e; struct printk_record dest_r; - prb_rec_init_wr(&dest_r, r->info->text_len, 0); + prb_rec_init_wr(&dest_r, r->info->text_len); if (!prb_reserve(&e, rb, &dest_r)) return 0; @@ -1106,7 +1101,6 @@ void __init setup_log_buf(int early) struct printk_record r; size_t new_descs_size; unsigned long flags; - char *new_dict_buf; char *new_log_buf; unsigned int free; u64 seq; @@ -1141,20 +1135,11 @@ void __init setup_log_buf(int early) return; } - new_dict_buf = memblock_alloc(new_log_buf_len, LOG_ALIGN); - if (unlikely(!new_dict_buf)) { - pr_err("log_buf_len: %lu dict bytes not available\n", - new_log_buf_len); - memblock_free(__pa(new_log_buf), new_log_buf_len); - return; - } - new_descs_size = new_descs_count * sizeof(struct prb_desc); new_descs = memblock_alloc(new_descs_size, LOG_ALIGN); if (unlikely(!new_descs)) { pr_err("log_buf_len: %zu desc bytes not available\n",
[PATCH printk 2/3] printk: move dictionary keys to dev_printk_info
Dictionaries are only used for SUBSYSTEM and DEVICE properties. The current implementation stores the property names each time they are used. This requires more space than otherwise necessary. Also, because the dictionary entries are currently considered optional, it cannot be relied upon that they are always available, even if the writer wanted to store them. These issues will increase should new dictionary properties be introduced. Rather than storing the subsystem and device properties in the dict ring, introduce a struct dev_printk_info with separate fields to store only the property values. Embed this struct within the struct printk_info to provide guaranteed availability. Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 73 drivers/base/core.c | 46 ++--- include/linux/dev_printk.h| 8 + include/linux/printk.h| 6 +- kernel/printk/internal.h | 4 +- kernel/printk/printk.c| 161 +- kernel/printk/printk_ringbuffer.h | 3 + kernel/printk/printk_safe.c | 2 +- scripts/gdb/linux/dmesg.py| 16 +- 9 files changed, 159 insertions(+), 160 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 94fabb165abf..82aecdcae8a6 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -172,13 +172,13 @@ end define dump_record set var $desc = $arg0 - if ($argc > 1) - set var $prev_flags = $arg1 + set var $info = $arg1 + if ($argc > 2) + set var $prev_flags = $arg2 else set var $prev_flags = 0 end - set var $info = &$desc->info set var $prefix = 1 set var $newline = 1 @@ -237,44 +237,36 @@ define dump_record # handle dictionary data - set var $begin = $desc->dict_blk_lpos.begin % (1U << prb->dict_data_ring.size_bits) - set var $next = $desc->dict_blk_lpos.next % (1U << prb->dict_data_ring.size_bits) - - # handle data-less record - if ($begin & 1) - set var $dict_len = 0 - set var $dict = "" - else - # handle wrapping data block - if ($begin > $next) - set var $begin = 0 - end - - # skip over descriptor id - set var $begin = $begin + sizeof(long) - - # handle truncated message - if ($next - $begin < $info->dict_len) - set var $dict_len = $next - $begin - else - set var $dict_len = $info->dict_len + set var $dict = &$info->dev_info.subsystem[0] + set var $dict_len = sizeof($info->dev_info.subsystem) + if ($dict[0] != '\0') + printf " SUBSYSTEM=" + set var $idx = 0 + while ($idx < $dict_len) + set var $c = $dict[$idx] + if ($c == '\0') + loop_break + else + if ($c < ' ' || $c >= 127 || $c == '\\') + printf "\\x%02x", $c + else + printf "%c", $c + end + end + set var $idx = $idx + 1 end - - set var $dict = &prb->dict_data_ring.data[$begin] + printf "\n" end - if ($dict_len > 0) + set var $dict = &$info->dev_info.device[0] + set var $dict_len = sizeof($info->dev_info.device) + if ($dict[0] != '\0') + printf " DEVICE=" set var $idx = 0 - set var $line = 1 while ($idx < $dict_len) - if ($line) - printf " " - set var $line = 0 - end set var $c = $dict[$idx] if ($c == '\0') - printf "\n" - set var $line = 1 + loop_break else if ($c < ' ' || $c >= 127 || $c == '\\') printf "\\x%02x", $c @@ -288,10 +280,10 @@ define dump_record end end document dump_record - Dump a single record. The first parameter is t
[PATCH printk 0/3] printk: move dictionaries to meta data
Hello, Here is a series to move dictionary properties (currently only SUBSYSTEM and DEVICE exist) into the meta data of a record, thus eliminating the need for the dict ring. This change affects how the dictionaries are stored, but does not affect how they are presented to userspace. The main purpose of the change is to address concerns [0] about the reliability of dictionary properties as well as allowing to efficiently expand the type and number of properties available [1]. This series is based heavily on the proof of concept [2] from Petr Mladek. (Petr, feel free to add Co-developed-by tags.) The series is based on the printk-rework branch of the printk git tree: f5f022e53b87 ("printk: reimplement log_cont using record extension") John Ogness [0] https://lkml.kernel.org/r/20200904151336.GC20558@alley [1] https://lkml.kernel.org/r/008801d684f9$43e1c140$cba543c0$@samsung.com [2] https://lkml.kernel.org/r/20200911095035.GI3864@alley John Ogness (3): printk: move printk_info into separate array printk: move dictionary keys to dev_printk_info printk: remove dict ring Documentation/admin-guide/kdump/gdbmacros.txt | 73 ++--- drivers/base/core.c | 46 +-- include/linux/dev_printk.h| 8 + include/linux/printk.h| 6 +- kernel/printk/internal.h | 4 +- kernel/printk/printk.c| 209 ++--- kernel/printk/printk_ringbuffer.c | 292 -- kernel/printk/printk_ringbuffer.h | 95 ++ kernel/printk/printk_safe.c | 2 +- scripts/gdb/linux/dmesg.py| 16 +- 10 files changed, 336 insertions(+), 415 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk 1/3] printk: move printk_info into separate array
The majority of the size of a descriptor is taken up by meta data, which is often not of interest to the ringbuffer (for example, when performing state checks). Since descriptors are often temporarily stored on the stack, keeping their size minimal will help reduce stack pressure. Rather than embedding the printk_info into the descriptor, create a separate printk_info array. The index of a descriptor in the descriptor array corresponds to the printk_info with the same index in the printk_info array. The rules for validity of a printk_info match the existing rules for the data blocks: the descriptor must be in a consistent state. Signed-off-by: John Ogness --- kernel/printk/printk.c| 17 +++- kernel/printk/printk_ringbuffer.c | 145 +++--- kernel/printk/printk_ringbuffer.h | 29 +++--- 3 files changed, 125 insertions(+), 66 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 9a2e23191576..7ad45d897277 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -959,11 +959,11 @@ void log_buf_vmcoreinfo_setup(void) VMCOREINFO_STRUCT_SIZE(prb_desc_ring); VMCOREINFO_OFFSET(prb_desc_ring, count_bits); VMCOREINFO_OFFSET(prb_desc_ring, descs); + VMCOREINFO_OFFSET(prb_desc_ring, infos); VMCOREINFO_OFFSET(prb_desc_ring, head_id); VMCOREINFO_OFFSET(prb_desc_ring, tail_id); VMCOREINFO_STRUCT_SIZE(prb_desc); - VMCOREINFO_OFFSET(prb_desc, info); VMCOREINFO_OFFSET(prb_desc, state_var); VMCOREINFO_OFFSET(prb_desc, text_blk_lpos); VMCOREINFO_OFFSET(prb_desc, dict_blk_lpos); @@ -1097,6 +1097,7 @@ static char setup_dict_buf[CONSOLE_EXT_LOG_MAX] __initdata; void __init setup_log_buf(int early) { + struct printk_info *new_infos; unsigned int new_descs_count; struct prb_desc *new_descs; struct printk_info info; @@ -1156,6 +1157,17 @@ void __init setup_log_buf(int early) return; } + new_descs_size = new_descs_count * sizeof(struct printk_info); + new_infos = memblock_alloc(new_descs_size, LOG_ALIGN); + if (unlikely(!new_infos)) { + pr_err("log_buf_len: %zu info bytes not available\n", + new_descs_size); + memblock_free(__pa(new_descs), new_log_buf_len); + memblock_free(__pa(new_dict_buf), new_log_buf_len); + memblock_free(__pa(new_log_buf), new_log_buf_len); + return; + } + prb_rec_init_rd(&r, &info, &setup_text_buf[0], sizeof(setup_text_buf), &setup_dict_buf[0], sizeof(setup_dict_buf)); @@ -1163,7 +1175,8 @@ void __init setup_log_buf(int early) prb_init(&printk_rb_dynamic, new_log_buf, ilog2(new_log_buf_len), new_dict_buf, ilog2(new_log_buf_len), -new_descs, ilog2(new_descs_count)); +new_descs, ilog2(new_descs_count), +new_infos); logbuf_lock_irqsave(flags); diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index f4e2e9890e0f..de4b10a98623 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -15,10 +15,10 @@ * The printk_ringbuffer is made up of 3 internal ringbuffers: * * desc_ring - * A ring of descriptors. A descriptor contains all record meta data - * (sequence number, timestamp, loglevel, etc.) as well as internal state - * information about the record and logical positions specifying where in - * the other ringbuffers the text and dictionary strings are located. + * A ring of descriptors and their meta data (such as sequence number, + * timestamp, loglevel, etc.) as well as internal state information about + * the record and logical positions specifying where in the other + * ringbuffers the text and dictionary strings are located. * * text_data_ring * A ring of data blocks. A data block consists of an unsigned long @@ -38,13 +38,14 @@ * * Descriptor Ring * ~~~ - * The descriptor ring is an array of descriptors. A descriptor contains all - * the meta data of a printk record as well as blk_lpos structs pointing to - * associated text and dictionary data blocks (see "Data Rings" below). Each - * descriptor is assigned an ID that maps directly to index values of the - * descriptor array and has a state. The ID and the state are bitwise combined - * into a single descriptor field named @state_var, allowing ID and state to - * be synchronously and atomically updated. + * The descriptor ring is an array of descriptors. A descriptor contains + * essential meta data to track the data of a printk record using + * blk_lpos structs pointing to associated text and dictionary data blocks + * (see "Data Rings" below). Each descriptor is assigned a
[PATCH printk v5 0/6] printk: reimplement LOG_CONT handling
Hello, Here is v5 for the second series to rework the printk subsystem. (The v4 is here [0].) This series implements a new ringbuffer feature that allows the last record to be extended. Petr Mladek provided the initial proof of concept [1] for this. Using the record extension feature, LOG_CONT is re-implemented in a way that exactly preserves its behavior, but avoids the need for an extra buffer. In particular, it avoids the need for any synchronization that such a buffer requires. This series deviates from the agreements [2] made at the meeting during LPC2019 in Lisbon. The test results of the v1 series, which implemented LOG_CONT as agreed upon, showed that the effects on existing userspace tools using /dev/kmsg (journalctl, dmesg) were not acceptable [3]. Patch 5 introduces *four* new memory barrier pairs. Two of them are insignificant additions (data_realloc:A/desc_read:D and data_realloc:A/data_push_tail:B) because they are alternate path memory barriers that exactly match the purpose and context of the two existing memory barrier pairs they provide an alternate path for. The other two new memory barrier pairs are significant additions: desc_reopen_last:A / _prb_commit:B - When reopening a descriptor, ensure the state transitions back to desc_reserved before fully trusting the descriptor data. _prb_commit:B / desc_reserve:D - When committing a descriptor, ensure the state transitions to desc_committed before checking the head ID to see if the descriptor needs to be finalized. The test module used to test the ringbuffer is available here [4]. The series is based on the printk-rework branch of the printk git tree: e60768311af8 ("scripts/gdb: update for lockless printk ringbuffer") The list of changes since v4: printk_ringbuffer = - desc_read(): revert setting @state_var when inconsistent (a separate series [5] is addressing this bug) - desc_reserve(): use DESC_SV() when setting reserved - data_realloc(): also do nothing if the size is the same - prb_reserve_in_last(): adjust dataless checks/warnings to match the non-dataless case - prb_reserve_in_last(): fix length modifier in warnings - change comments about "state flags" to just talk about "states" John Ogness [0] https://lkml.kernel.org/r/20200908202859.2736-1-john.ogn...@linutronix.de [1] https://lkml.kernel.org/r/20200812163908.GH12903@alley [2] https://lkml.kernel.org/r/87k1acz5rx@linutronix.de [3] https://lkml.kernel.org/r/20200811160551.GC12903@alley [4] https://github.com/Linutronix/prb-test.git [5] https://lkml.kernel.org/r/20200914094803.27365-1-john.ogn...@linutronix.de John Ogness (6): printk: ringbuffer: relocate get_data() printk: ringbuffer: add BLK_DATALESS() macro printk: ringbuffer: clear initial reserved fields printk: ringbuffer: change representation of states printk: ringbuffer: add finalization/extension support printk: reimplement log_cont using record extension Documentation/admin-guide/kdump/gdbmacros.txt | 13 +- kernel/printk/printk.c| 110 +-- kernel/printk/printk_ringbuffer.c | 683 ++ kernel/printk/printk_ringbuffer.h | 35 +- scripts/gdb/linux/dmesg.py| 12 +- 5 files changed, 615 insertions(+), 238 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v5 4/6] printk: ringbuffer: change representation of states
Rather than deriving the state by evaluating bits within the flags area of the state variable, assign the states explicit values and set those values in the flags area. Introduce macros to make it simple to read and write state values for the state variable. Although the functionality is preserved, the binary representation for the states is changed. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- Documentation/admin-guide/kdump/gdbmacros.txt | 12 --- kernel/printk/printk_ringbuffer.c | 28 + kernel/printk/printk_ringbuffer.h | 31 --- scripts/gdb/linux/dmesg.py| 11 --- 4 files changed, 41 insertions(+), 41 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 7adece30237e..8f533b751c46 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -295,9 +295,12 @@ document dump_record end define dmesg - set var $desc_committed = 1UL << ((sizeof(long) * 8) - 1) - set var $flags_mask = 3UL << ((sizeof(long) * 8) - 2) - set var $id_mask = ~$flags_mask + # definitions from kernel/printk/printk_ringbuffer.h + set var $desc_committed = 1 + set var $desc_sv_bits = sizeof(long) * 8 + set var $desc_flags_shift = $desc_sv_bits - 2 + set var $desc_flags_mask = 3 << $desc_flags_shift + set var $id_mask = ~$desc_flags_mask set var $desc_count = 1U << prb->desc_ring.count_bits set var $prev_flags = 0 @@ -309,7 +312,8 @@ define dmesg set var $desc = &prb->desc_ring.descs[$id % $desc_count] # skip non-committed record - if (($desc->state_var.counter & $flags_mask) == $desc_committed) + set var $state = 3 & ($desc->state_var.counter >> $desc_flags_shift) + if ($state == $desc_committed) dump_record $desc $prev_flags set var $prev_flags = $desc->info.flags end diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 82347abb22a5..911fbe150e9a 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -348,14 +348,6 @@ static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size) return true; } -/* The possible responses of a descriptor state-query. */ -enum desc_state { - desc_miss, /* ID mismatch */ - desc_reserved, /* reserved, in use by writer */ - desc_committed, /* committed, writer is done */ - desc_reusable, /* free, not yet used by any writer */ -}; - /* Query the state of a descriptor. */ static enum desc_state get_desc_state(unsigned long id, unsigned long state_val) @@ -363,13 +355,7 @@ static enum desc_state get_desc_state(unsigned long id, if (id != DESC_ID(state_val)) return desc_miss; - if (state_val & DESC_REUSE_MASK) - return desc_reusable; - - if (state_val & DESC_COMMITTED_MASK) - return desc_committed; - - return desc_reserved; + return DESC_STATE(state_val); } /* @@ -467,8 +453,8 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring, static void desc_make_reusable(struct prb_desc_ring *desc_ring, unsigned long id) { - unsigned long val_committed = id | DESC_COMMITTED_MASK; - unsigned long val_reusable = val_committed | DESC_REUSE_MASK; + unsigned long val_committed = DESC_SV(id, desc_committed); + unsigned long val_reusable = DESC_SV(id, desc_reusable); struct prb_desc *desc = to_desc(desc_ring, id); atomic_long_t *state_var = &desc->state_var; @@ -904,7 +890,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) */ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */ if (prev_state_val && - prev_state_val != (id_prev_wrap | DESC_COMMITTED_MASK | DESC_REUSE_MASK)) { + get_desc_state(id_prev_wrap, prev_state_val) != desc_reusable) { WARN_ON_ONCE(1); return false; } @@ -918,7 +904,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) * This pairs with desc_read:D. */ if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val, -id | 0)) { /* LMM(desc_reserve:F) */ + DESC_SV(id, desc_reserved))) { /* LMM(desc_reserve:F) */ WARN_ON_ONCE(1); return false; } @@ -1237,7 +1223,7 @@ void prb_commit(struct prb_reserved_entry *e) { struct p
[PATCH printk v5 2/6] printk: ringbuffer: add BLK_DATALESS() macro
Rather than continually needing to explicitly check @begin and @next to identify a dataless block, introduce and use a BLK_DATALESS() macro. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index aa6e31a27601..6ee5ebce1450 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -266,6 +266,8 @@ /* Determine if a logical position refers to a data-less block. */ #define LPOS_DATALESS(lpos)((lpos) & 1UL) +#define BLK_DATALESS(blk) (LPOS_DATALESS((blk)->begin) && \ +LPOS_DATALESS((blk)->next)) /* Get the logical position at index 0 of the current wrap. */ #define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \ @@ -1021,7 +1023,7 @@ static unsigned int space_used(struct prb_data_ring *data_ring, struct prb_data_blk_lpos *blk_lpos) { /* Data-less blocks take no space. */ - if (LPOS_DATALESS(blk_lpos->begin)) + if (BLK_DATALESS(blk_lpos)) return 0; if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) { @@ -1054,7 +1056,7 @@ static const char *get_data(struct prb_data_ring *data_ring, struct prb_data_block *db; /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (BLK_DATALESS(blk_lpos)) { if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { *data_size = 0; return ""; -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v5 1/6] printk: ringbuffer: relocate get_data()
Move the internal get_data() function as-is above prb_reserve() so that a later change can make use of the static function. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 116 +++--- 1 file changed, 58 insertions(+), 58 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 0659b50872b5..aa6e31a27601 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -1038,6 +1038,64 @@ static unsigned int space_used(struct prb_data_ring *data_ring, DATA_SIZE(data_ring) - DATA_INDEX(data_ring, blk_lpos->begin)); } +/* + * Given @blk_lpos, return a pointer to the writer data from the data block + * and calculate the size of the data part. A NULL pointer is returned if + * @blk_lpos specifies values that could never be legal. + * + * This function (used by readers) performs strict validation on the lpos + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is + * triggered if an internal error is detected. + */ +static const char *get_data(struct prb_data_ring *data_ring, + struct prb_data_blk_lpos *blk_lpos, + unsigned int *data_size) +{ + struct prb_data_block *db; + + /* Data-less data block description. */ + if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { + *data_size = 0; + return ""; + } + return NULL; + } + + /* Regular data block: @begin less than @next and in same wrap. */ + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && + blk_lpos->begin < blk_lpos->next) { + db = to_block(data_ring, blk_lpos->begin); + *data_size = blk_lpos->next - blk_lpos->begin; + + /* Wrapping data block: @begin is one wrap behind @next. */ + } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == + DATA_WRAPS(data_ring, blk_lpos->next)) { + db = to_block(data_ring, 0); + *data_size = DATA_INDEX(data_ring, blk_lpos->next); + + /* Illegal block description. */ + } else { + WARN_ON_ONCE(1); + return NULL; + } + + /* A valid data block will always be aligned to the ID size. */ + if (WARN_ON_ONCE(blk_lpos->begin != ALIGN(blk_lpos->begin, sizeof(db->id))) || + WARN_ON_ONCE(blk_lpos->next != ALIGN(blk_lpos->next, sizeof(db->id { + return NULL; + } + + /* A valid data block will always have at least an ID. */ + if (WARN_ON_ONCE(*data_size < sizeof(db->id))) + return NULL; + + /* Subtract block ID space from size to reflect data size. */ + *data_size -= sizeof(db->id); + + return &db->data[0]; +} + /** * prb_reserve() - Reserve space in the ringbuffer. * @@ -1192,64 +1250,6 @@ void prb_commit(struct prb_reserved_entry *e) local_irq_restore(e->irqflags); } -/* - * Given @blk_lpos, return a pointer to the writer data from the data block - * and calculate the size of the data part. A NULL pointer is returned if - * @blk_lpos specifies values that could never be legal. - * - * This function (used by readers) performs strict validation on the lpos - * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is - * triggered if an internal error is detected. - */ -static const char *get_data(struct prb_data_ring *data_ring, - struct prb_data_blk_lpos *blk_lpos, - unsigned int *data_size) -{ - struct prb_data_block *db; - - /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { - if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { - *data_size = 0; - return ""; - } - return NULL; - } - - /* Regular data block: @begin less than @next and in same wrap. */ - if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && - blk_lpos->begin < blk_lpos->next) { - db = to_block(data_ring, blk_lpos->begin); - *data_size = blk_lpos->next - blk_lpos->begin; - - /* Wrapping data block: @begin is one wrap behind @next. */ - } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == - DATA_WRAPS(data_ring, blk_lpos->next)) { - db = to_block(data_rin
[PATCH printk v5 6/6] printk: reimplement log_cont using record extension
Use the record extending feature of the ringbuffer to implement continuous messages. This preserves the existing continuous message behavior. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk.c | 98 +- 1 file changed, 20 insertions(+), 78 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 964b5701688f..9a2e23191576 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -535,7 +535,10 @@ static int log_store(u32 caller_id, int facility, int level, r.info->caller_id = caller_id; /* insert message */ - prb_commit(&e); + if ((flags & LOG_CONT) || !(flags & LOG_NEWLINE)) + prb_commit(&e); + else + prb_final_commit(&e); return (text_len + trunc_msg_len); } @@ -1084,7 +1087,7 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, dest_r.info->ts_nsec = r->info->ts_nsec; dest_r.info->caller_id = r->info->caller_id; - prb_commit(&e); + prb_final_commit(&e); return prb_record_text_space(&e); } @@ -1884,87 +1887,26 @@ static inline u32 printk_caller_id(void) 0x8000 + raw_smp_processor_id(); } -/* - * Continuation lines are buffered, and not committed to the record buffer - * until the line is complete, or a race forces it. The line fragments - * though, are printed immediately to the consoles to ensure everything has - * reached the console in case of a kernel crash. - */ -static struct cont { - char buf[LOG_LINE_MAX]; - size_t len; /* length == 0 means unused buffer */ - u32 caller_id; /* printk_caller_id() of first print */ - u64 ts_nsec;/* time of first print */ - u8 level; /* log level of first message */ - u8 facility;/* log facility of first message */ - enum log_flags flags; /* prefix, newline flags */ -} cont; - -static void cont_flush(void) -{ - if (cont.len == 0) - return; - - log_store(cont.caller_id, cont.facility, cont.level, cont.flags, - cont.ts_nsec, NULL, 0, cont.buf, cont.len); - cont.len = 0; -} - -static bool cont_add(u32 caller_id, int facility, int level, -enum log_flags flags, const char *text, size_t len) -{ - /* If the line gets too long, split it up in separate records. */ - if (cont.len + len > sizeof(cont.buf)) { - cont_flush(); - return false; - } - - if (!cont.len) { - cont.facility = facility; - cont.level = level; - cont.caller_id = caller_id; - cont.ts_nsec = local_clock(); - cont.flags = flags; - } - - memcpy(cont.buf + cont.len, text, len); - cont.len += len; - - // The original flags come from the first line, - // but later continuations can add a newline. - if (flags & LOG_NEWLINE) { - cont.flags |= LOG_NEWLINE; - cont_flush(); - } - - return true; -} - static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len) { const u32 caller_id = printk_caller_id(); - /* -* If an earlier line was buffered, and we're a continuation -* write from the same context, try to add it to the buffer. -*/ - if (cont.len) { - if (cont.caller_id == caller_id && (lflags & LOG_CONT)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - /* Otherwise, make sure it's flushed */ - cont_flush(); - } - - /* Skip empty continuation lines that couldn't be added - they just flush */ - if (!text_len && (lflags & LOG_CONT)) - return 0; - - /* If it doesn't end in a newline, try to buffer the current line */ - if (!(lflags & LOG_NEWLINE)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) + if (lflags & LOG_CONT) { + struct prb_reserved_entry e; + struct printk_record r; + + prb_rec_init_wr(&r, text_len, 0); + if (prb_reserve_in_last(&e, prb, &r, caller_id)) { + memcpy(&r.text_buf[r.info->text_len], text, text_len); + r.info->text_len += text_len; + if (lflags & LOG_NEWLINE) { + r.info->flags |= LOG_NEWLINE; + prb_final_commit(&e); +
[PATCH printk v5 5/6] printk: ringbuffer: add finalization/extension support
Add support for extending the newest data block. For this, introduce a new finalization state (desc_finalized) denoting a committed descriptor that cannot be extended. Until a record is finalized, a writer can reopen that record to append new data. Reopening a record means transitioning from the desc_committed state back to the desc_reserved state. A writer can explicitly finalize a record if there is no intention of extending it. Also, records are automatically finalized when a new record is reserved. This relieves writers of needing to explicitly finalize while also making such records available to readers sooner. (Readers can only traverse finalized records.) Four new memory barrier pairs are introduced. Two of them are insignificant additions (data_realloc:A/desc_read:D and data_realloc:A/data_push_tail:B) because they are alternate path memory barriers that exactly match the purpose, pairing, and context of the two existing memory barrier pairs they provide an alternate path for. The other two new memory barrier pairs are significant additions: desc_reopen_last:A / _prb_commit:B - When reopening a descriptor, ensure the state transitions back to desc_reserved before fully trusting the descriptor data. _prb_commit:B / desc_reserve:D - When committing a descriptor, ensure the state transitions to desc_committed before checking the head ID to see if the descriptor needs to be finalized. Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 3 +- kernel/printk/printk_ringbuffer.c | 525 -- kernel/printk/printk_ringbuffer.h | 6 +- scripts/gdb/linux/dmesg.py| 3 +- 4 files changed, 480 insertions(+), 57 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 8f533b751c46..94fabb165abf 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -297,6 +297,7 @@ end define dmesg # definitions from kernel/printk/printk_ringbuffer.h set var $desc_committed = 1 + set var $desc_finalized = 2 set var $desc_sv_bits = sizeof(long) * 8 set var $desc_flags_shift = $desc_sv_bits - 2 set var $desc_flags_mask = 3 << $desc_flags_shift @@ -313,7 +314,7 @@ define dmesg # skip non-committed record set var $state = 3 & ($desc->state_var.counter >> $desc_flags_shift) - if ($state == $desc_committed) + if ($state == $desc_committed || $state == $desc_finalized) dump_record $desc $prev_flags set var $prev_flags = $desc->info.flags end diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 911fbe150e9a..4e526c79f89c 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -46,20 +46,26 @@ * into a single descriptor field named @state_var, allowing ID and state to * be synchronously and atomically updated. * - * Descriptors have three states: + * Descriptors have four states: * * reserved * A writer is modifying the record. * * committed - * The record and all its data are complete and available for reading. + * The record and all its data are written. A writer can reopen the + * descriptor (transitioning it back to reserved), but in the committed + * state the data is consistent. + * + * finalized + * The record and all its data are complete and available for reading. A + * writer cannot reopen the descriptor. * * reusable * The record exists, but its text and/or dictionary data may no longer * be available. * * Querying the @state_var of a record requires providing the ID of the - * descriptor to query. This can yield a possible fourth (pseudo) state: + * descriptor to query. This can yield a possible fifth (pseudo) state: * * miss * The descriptor being queried has an unexpected ID. @@ -79,6 +85,28 @@ * committed or reusable queried state. This makes it possible that a valid * sequence number of the tail is always available. * + * Descriptor Finalization + * ~~~ + * When a writer calls the commit function prb_commit(), record data is + * fully stored and is consistent within the ringbuffer. However, a writer can + * reopen that record, claiming exclusive access (as with prb_reserve()), and + * modify that record. When finished, the writer must again commit the record. + * + * In order for a record to be made available to readers (and also become + * recyclable for writers), it must be finalized. A finalized record cannot be + * reopened and can never become "unfinalized". Record finalization can occur + * in three different scenarios: + * + * 1) A writer can simultaneously commit and finalize its record by c
[PATCH printk v5 3/6] printk: ringbuffer: clear initial reserved fields
prb_reserve() will set some meta data values and leave others uninitialized (or rather, containing the values of the previous wrap). Simplify the API by always clearing out all the fields. Only the sequence number is filled in. The caller is now responsible for filling in the rest of the meta data fields. In particular, for correctly filling in text and dict lengths. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk.c| 12 kernel/printk/printk_ringbuffer.c | 30 ++ 2 files changed, 26 insertions(+), 16 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index fec71229169e..964b5701688f 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -520,8 +520,11 @@ static int log_store(u32 caller_id, int facility, int level, memcpy(&r.text_buf[0], text, text_len); if (trunc_msg_len) memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len); - if (r.dict_buf) + r.info->text_len = text_len + trunc_msg_len; + if (r.dict_buf) { memcpy(&r.dict_buf[0], dict, dict_len); + r.info->dict_len = dict_len; + } r.info->facility = facility; r.info->level = level & 7; r.info->flags = flags & 0x1f; @@ -1069,10 +1072,11 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, if (!prb_reserve(&e, rb, &dest_r)) return 0; - memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size); + memcpy(&dest_r.text_buf[0], &r->text_buf[0], r->info->text_len); + dest_r.info->text_len = r->info->text_len; if (dest_r.dict_buf) { - memcpy(&dest_r.dict_buf[0], &r->dict_buf[0], - dest_r.dict_buf_size); + memcpy(&dest_r.dict_buf[0], &r->dict_buf[0], r->info->dict_len); + dest_r.info->dict_len = r->info->dict_len; } dest_r.info->facility = r->info->facility; dest_r.info->level = r->info->level; diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 6ee5ebce1450..82347abb22a5 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -146,10 +146,13 @@ * * if (prb_reserve(&e, &test_rb, &r)) { * snprintf(r.text_buf, r.text_buf_size, "%s", textstr); + * r.info->text_len = strlen(textstr); * * // dictionary allocation may have failed - * if (r.dict_buf) + * if (r.dict_buf) { * snprintf(r.dict_buf, r.dict_buf_size, "%s", dictstr); + * r.info->dict_len = strlen(dictstr); + * } * * r.info->ts_nsec = local_clock(); * @@ -1125,9 +1128,9 @@ static const char *get_data(struct prb_data_ring *data_ring, * @dict_buf_size is set to 0. Writers must check this before writing to * dictionary space. * - * @info->text_len and @info->dict_len will already be set to @text_buf_size - * and @dict_buf_size, respectively. If dictionary space reservation fails, - * @info->dict_len is set to 0. + * Important: @info->text_len and @info->dict_len need to be set correctly by + *the writer in order for data to be readable and/or extended. + *Their values are initialized to 0. */ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, struct printk_record *r) @@ -1135,6 +1138,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, struct prb_desc_ring *desc_ring = &rb->desc_ring; struct prb_desc *d; unsigned long id; + u64 seq; if (!data_check_size(&rb->text_data_ring, r->text_buf_size)) goto fail; @@ -1159,6 +1163,14 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, d = to_desc(desc_ring, id); + /* +* All @info fields (except @seq) are cleared and must be filled in +* by the writer. Save @seq before clearing because it is used to +* determine the new sequence number. +*/ + seq = d->info.seq; + memset(&d->info, 0, sizeof(d->info)); + /* * Set the @e fields here so that prb_commit() can be used if * text data allocation fails. @@ -1177,17 +1189,15 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, * See the "Bootstrap" comment block in printk_ringbuffer.h for * details about how the initializer bootstraps the descriptors. */ - if (d->info.seq == 0 && DESC_INDEX(desc_ring, id) != 0) + if (seq == 0 && DESC_INDE
[PATCH printk v4 2/6] printk: ringbuffer: add BLK_DATALESS() macro
Rather than continually needing to explicitly check @begin and @next to identify a dataless block, introduce and use a BLK_DATALESS() macro. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index aa6e31a27601..6ee5ebce1450 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -266,6 +266,8 @@ /* Determine if a logical position refers to a data-less block. */ #define LPOS_DATALESS(lpos)((lpos) & 1UL) +#define BLK_DATALESS(blk) (LPOS_DATALESS((blk)->begin) && \ +LPOS_DATALESS((blk)->next)) /* Get the logical position at index 0 of the current wrap. */ #define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \ @@ -1021,7 +1023,7 @@ static unsigned int space_used(struct prb_data_ring *data_ring, struct prb_data_blk_lpos *blk_lpos) { /* Data-less blocks take no space. */ - if (LPOS_DATALESS(blk_lpos->begin)) + if (BLK_DATALESS(blk_lpos)) return 0; if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) { @@ -1054,7 +1056,7 @@ static const char *get_data(struct prb_data_ring *data_ring, struct prb_data_block *db; /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (BLK_DATALESS(blk_lpos)) { if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { *data_size = 0; return ""; -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH printk v4 6/6] printk: reimplement log_cont using record extension
Use the record extending feature of the ringbuffer to implement continuous messages. This preserves the existing continuous message behavior. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk.c | 98 +- 1 file changed, 20 insertions(+), 78 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 964b5701688f..9a2e23191576 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -535,7 +535,10 @@ static int log_store(u32 caller_id, int facility, int level, r.info->caller_id = caller_id; /* insert message */ - prb_commit(&e); + if ((flags & LOG_CONT) || !(flags & LOG_NEWLINE)) + prb_commit(&e); + else + prb_final_commit(&e); return (text_len + trunc_msg_len); } @@ -1084,7 +1087,7 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, dest_r.info->ts_nsec = r->info->ts_nsec; dest_r.info->caller_id = r->info->caller_id; - prb_commit(&e); + prb_final_commit(&e); return prb_record_text_space(&e); } @@ -1884,87 +1887,26 @@ static inline u32 printk_caller_id(void) 0x8000 + raw_smp_processor_id(); } -/* - * Continuation lines are buffered, and not committed to the record buffer - * until the line is complete, or a race forces it. The line fragments - * though, are printed immediately to the consoles to ensure everything has - * reached the console in case of a kernel crash. - */ -static struct cont { - char buf[LOG_LINE_MAX]; - size_t len; /* length == 0 means unused buffer */ - u32 caller_id; /* printk_caller_id() of first print */ - u64 ts_nsec;/* time of first print */ - u8 level; /* log level of first message */ - u8 facility;/* log facility of first message */ - enum log_flags flags; /* prefix, newline flags */ -} cont; - -static void cont_flush(void) -{ - if (cont.len == 0) - return; - - log_store(cont.caller_id, cont.facility, cont.level, cont.flags, - cont.ts_nsec, NULL, 0, cont.buf, cont.len); - cont.len = 0; -} - -static bool cont_add(u32 caller_id, int facility, int level, -enum log_flags flags, const char *text, size_t len) -{ - /* If the line gets too long, split it up in separate records. */ - if (cont.len + len > sizeof(cont.buf)) { - cont_flush(); - return false; - } - - if (!cont.len) { - cont.facility = facility; - cont.level = level; - cont.caller_id = caller_id; - cont.ts_nsec = local_clock(); - cont.flags = flags; - } - - memcpy(cont.buf + cont.len, text, len); - cont.len += len; - - // The original flags come from the first line, - // but later continuations can add a newline. - if (flags & LOG_NEWLINE) { - cont.flags |= LOG_NEWLINE; - cont_flush(); - } - - return true; -} - static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len) { const u32 caller_id = printk_caller_id(); - /* -* If an earlier line was buffered, and we're a continuation -* write from the same context, try to add it to the buffer. -*/ - if (cont.len) { - if (cont.caller_id == caller_id && (lflags & LOG_CONT)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - /* Otherwise, make sure it's flushed */ - cont_flush(); - } - - /* Skip empty continuation lines that couldn't be added - they just flush */ - if (!text_len && (lflags & LOG_CONT)) - return 0; - - /* If it doesn't end in a newline, try to buffer the current line */ - if (!(lflags & LOG_NEWLINE)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) + if (lflags & LOG_CONT) { + struct prb_reserved_entry e; + struct printk_record r; + + prb_rec_init_wr(&r, text_len, 0); + if (prb_reserve_in_last(&e, prb, &r, caller_id)) { + memcpy(&r.text_buf[r.info->text_len], text, text_len); + r.info->text_len += text_len; + if (lflags & LOG_NEWLINE) { + r.info->flags |= LOG_NEWLINE; + prb_final_commit(&e); +
[PATCH printk v4 1/6] printk: ringbuffer: relocate get_data()
Move the internal get_data() function as-is above prb_reserve() so that a later change can make use of the static function. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 116 +++--- 1 file changed, 58 insertions(+), 58 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 0659b50872b5..aa6e31a27601 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -1038,6 +1038,64 @@ static unsigned int space_used(struct prb_data_ring *data_ring, DATA_SIZE(data_ring) - DATA_INDEX(data_ring, blk_lpos->begin)); } +/* + * Given @blk_lpos, return a pointer to the writer data from the data block + * and calculate the size of the data part. A NULL pointer is returned if + * @blk_lpos specifies values that could never be legal. + * + * This function (used by readers) performs strict validation on the lpos + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is + * triggered if an internal error is detected. + */ +static const char *get_data(struct prb_data_ring *data_ring, + struct prb_data_blk_lpos *blk_lpos, + unsigned int *data_size) +{ + struct prb_data_block *db; + + /* Data-less data block description. */ + if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { + *data_size = 0; + return ""; + } + return NULL; + } + + /* Regular data block: @begin less than @next and in same wrap. */ + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && + blk_lpos->begin < blk_lpos->next) { + db = to_block(data_ring, blk_lpos->begin); + *data_size = blk_lpos->next - blk_lpos->begin; + + /* Wrapping data block: @begin is one wrap behind @next. */ + } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == + DATA_WRAPS(data_ring, blk_lpos->next)) { + db = to_block(data_ring, 0); + *data_size = DATA_INDEX(data_ring, blk_lpos->next); + + /* Illegal block description. */ + } else { + WARN_ON_ONCE(1); + return NULL; + } + + /* A valid data block will always be aligned to the ID size. */ + if (WARN_ON_ONCE(blk_lpos->begin != ALIGN(blk_lpos->begin, sizeof(db->id))) || + WARN_ON_ONCE(blk_lpos->next != ALIGN(blk_lpos->next, sizeof(db->id { + return NULL; + } + + /* A valid data block will always have at least an ID. */ + if (WARN_ON_ONCE(*data_size < sizeof(db->id))) + return NULL; + + /* Subtract block ID space from size to reflect data size. */ + *data_size -= sizeof(db->id); + + return &db->data[0]; +} + /** * prb_reserve() - Reserve space in the ringbuffer. * @@ -1192,64 +1250,6 @@ void prb_commit(struct prb_reserved_entry *e) local_irq_restore(e->irqflags); } -/* - * Given @blk_lpos, return a pointer to the writer data from the data block - * and calculate the size of the data part. A NULL pointer is returned if - * @blk_lpos specifies values that could never be legal. - * - * This function (used by readers) performs strict validation on the lpos - * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is - * triggered if an internal error is detected. - */ -static const char *get_data(struct prb_data_ring *data_ring, - struct prb_data_blk_lpos *blk_lpos, - unsigned int *data_size) -{ - struct prb_data_block *db; - - /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { - if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { - *data_size = 0; - return ""; - } - return NULL; - } - - /* Regular data block: @begin less than @next and in same wrap. */ - if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && - blk_lpos->begin < blk_lpos->next) { - db = to_block(data_ring, blk_lpos->begin); - *data_size = blk_lpos->next - blk_lpos->begin; - - /* Wrapping data block: @begin is one wrap behind @next. */ - } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == - DATA_WRAPS(data_ring, blk_lpos->next)) { - db = to_block(data_rin
[PATCH printk v4 5/6] printk: ringbuffer: add finalization/extension support
Add support for extending the newest data block. For this, introduce a new finalization state (desc_finalized) denoting a committed descriptor that cannot be extended. Until a record is finalized, a writer can reopen that record to append new data. Reopening a record means transitioning from the desc_committed state back to the desc_reserved state. A writer can explicitly finalize a record if there is no intention of extending it. Also, records are automatically finalized when a new record is reserved. This relieves writers of needing to explicitly finalize while also making such records available to readers sooner. (Readers can only traverse finalized records.) Four new memory barrier pairs are introduced. Two of them are insignificant additions (data_realloc:A/desc_read:D and data_realloc:A/data_push_tail:B) because they are alternate path memory barriers that exactly match the purpose, pairing, and context of the two existing memory barrier pairs they provide an alternate path for. The other two new memory barrier pairs are significant additions: desc_reopen_last:A / _prb_commit:B - When reopening a descriptor, ensure the state transitions back to desc_reserved before fully trusting the descriptor data. _prb_commit:B / desc_reserve:D - When committing a descriptor, ensure the state transitions to desc_committed before checking the head ID to see if the descriptor needs to be finalized. Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 3 +- kernel/printk/printk_ringbuffer.c | 541 -- kernel/printk/printk_ringbuffer.h | 6 +- scripts/gdb/linux/dmesg.py| 3 +- 4 files changed, 491 insertions(+), 62 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 8f533b751c46..94fabb165abf 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -297,6 +297,7 @@ end define dmesg # definitions from kernel/printk/printk_ringbuffer.h set var $desc_committed = 1 + set var $desc_finalized = 2 set var $desc_sv_bits = sizeof(long) * 8 set var $desc_flags_shift = $desc_sv_bits - 2 set var $desc_flags_mask = 3 << $desc_flags_shift @@ -313,7 +314,7 @@ define dmesg # skip non-committed record set var $state = 3 & ($desc->state_var.counter >> $desc_flags_shift) - if ($state == $desc_committed) + if ($state == $desc_committed || $state == $desc_finalized) dump_record $desc $prev_flags set var $prev_flags = $desc->info.flags end diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 911fbe150e9a..f1fab8c82819 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -46,20 +46,26 @@ * into a single descriptor field named @state_var, allowing ID and state to * be synchronously and atomically updated. * - * Descriptors have three states: + * Descriptors have four states: * * reserved * A writer is modifying the record. * * committed - * The record and all its data are complete and available for reading. + * The record and all its data are written. A writer can reopen the + * descriptor (transitioning it back to reserved), but in the committed + * state the data is consistent. + * + * finalized + * The record and all its data are complete and available for reading. A + * writer cannot reopen the descriptor. * * reusable * The record exists, but its text and/or dictionary data may no longer * be available. * * Querying the @state_var of a record requires providing the ID of the - * descriptor to query. This can yield a possible fourth (pseudo) state: + * descriptor to query. This can yield a possible fifth (pseudo) state: * * miss * The descriptor being queried has an unexpected ID. @@ -79,6 +85,28 @@ * committed or reusable queried state. This makes it possible that a valid * sequence number of the tail is always available. * + * Descriptor Finalization + * ~~~ + * When a writer calls the commit function prb_commit(), record data is + * fully stored and is consistent within the ringbuffer. However, a writer can + * reopen that record, claiming exclusive access (as with prb_reserve()), and + * modify that record. When finished, the writer must again commit the record. + * + * In order for a record to be made available to readers (and also become + * recyclable for writers), it must be finalized. A finalized record cannot be + * reopened and can never become "unfinalized". Record finalization can occur + * in three different scenarios: + * + * 1) A writer can simultaneously commit and finalize its record by c
[PATCH printk v4 4/6] printk: ringbuffer: change representation of states
Rather than deriving the state by evaluating bits within the flags area of the state variable, assign the states explicit values and set those values in the flags area. Introduce macros to make it simple to read and write state values for the state variable. Although the functionality is preserved, the binary representation for the states is changed. Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 12 --- kernel/printk/printk_ringbuffer.c | 28 + kernel/printk/printk_ringbuffer.h | 31 --- scripts/gdb/linux/dmesg.py| 11 --- 4 files changed, 41 insertions(+), 41 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 7adece30237e..8f533b751c46 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -295,9 +295,12 @@ document dump_record end define dmesg - set var $desc_committed = 1UL << ((sizeof(long) * 8) - 1) - set var $flags_mask = 3UL << ((sizeof(long) * 8) - 2) - set var $id_mask = ~$flags_mask + # definitions from kernel/printk/printk_ringbuffer.h + set var $desc_committed = 1 + set var $desc_sv_bits = sizeof(long) * 8 + set var $desc_flags_shift = $desc_sv_bits - 2 + set var $desc_flags_mask = 3 << $desc_flags_shift + set var $id_mask = ~$desc_flags_mask set var $desc_count = 1U << prb->desc_ring.count_bits set var $prev_flags = 0 @@ -309,7 +312,8 @@ define dmesg set var $desc = &prb->desc_ring.descs[$id % $desc_count] # skip non-committed record - if (($desc->state_var.counter & $flags_mask) == $desc_committed) + set var $state = 3 & ($desc->state_var.counter >> $desc_flags_shift) + if ($state == $desc_committed) dump_record $desc $prev_flags set var $prev_flags = $desc->info.flags end diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 82347abb22a5..911fbe150e9a 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -348,14 +348,6 @@ static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size) return true; } -/* The possible responses of a descriptor state-query. */ -enum desc_state { - desc_miss, /* ID mismatch */ - desc_reserved, /* reserved, in use by writer */ - desc_committed, /* committed, writer is done */ - desc_reusable, /* free, not yet used by any writer */ -}; - /* Query the state of a descriptor. */ static enum desc_state get_desc_state(unsigned long id, unsigned long state_val) @@ -363,13 +355,7 @@ static enum desc_state get_desc_state(unsigned long id, if (id != DESC_ID(state_val)) return desc_miss; - if (state_val & DESC_REUSE_MASK) - return desc_reusable; - - if (state_val & DESC_COMMITTED_MASK) - return desc_committed; - - return desc_reserved; + return DESC_STATE(state_val); } /* @@ -467,8 +453,8 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring, static void desc_make_reusable(struct prb_desc_ring *desc_ring, unsigned long id) { - unsigned long val_committed = id | DESC_COMMITTED_MASK; - unsigned long val_reusable = val_committed | DESC_REUSE_MASK; + unsigned long val_committed = DESC_SV(id, desc_committed); + unsigned long val_reusable = DESC_SV(id, desc_reusable); struct prb_desc *desc = to_desc(desc_ring, id); atomic_long_t *state_var = &desc->state_var; @@ -904,7 +890,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) */ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */ if (prev_state_val && - prev_state_val != (id_prev_wrap | DESC_COMMITTED_MASK | DESC_REUSE_MASK)) { + get_desc_state(id_prev_wrap, prev_state_val) != desc_reusable) { WARN_ON_ONCE(1); return false; } @@ -918,7 +904,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) * This pairs with desc_read:D. */ if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val, -id | 0)) { /* LMM(desc_reserve:F) */ + DESC_SV(id, desc_reserved))) { /* LMM(desc_reserve:F) */ WARN_ON_ONCE(1); return false; } @@ -1237,7 +1223,7 @@ void prb_commit(struct prb_reserved_entry *e) { struct prb_desc_ring *desc_ring
[PATCH printk v4 3/6] printk: ringbuffer: clear initial reserved fields
prb_reserve() will set some meta data values and leave others uninitialized (or rather, containing the values of the previous wrap). Simplify the API by always clearing out all the fields. Only the sequence number is filled in. The caller is now responsible for filling in the rest of the meta data fields. In particular, for correctly filling in text and dict lengths. Signed-off-by: John Ogness --- kernel/printk/printk.c| 12 kernel/printk/printk_ringbuffer.c | 30 ++ 2 files changed, 26 insertions(+), 16 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index fec71229169e..964b5701688f 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -520,8 +520,11 @@ static int log_store(u32 caller_id, int facility, int level, memcpy(&r.text_buf[0], text, text_len); if (trunc_msg_len) memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len); - if (r.dict_buf) + r.info->text_len = text_len + trunc_msg_len; + if (r.dict_buf) { memcpy(&r.dict_buf[0], dict, dict_len); + r.info->dict_len = dict_len; + } r.info->facility = facility; r.info->level = level & 7; r.info->flags = flags & 0x1f; @@ -1069,10 +1072,11 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, if (!prb_reserve(&e, rb, &dest_r)) return 0; - memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size); + memcpy(&dest_r.text_buf[0], &r->text_buf[0], r->info->text_len); + dest_r.info->text_len = r->info->text_len; if (dest_r.dict_buf) { - memcpy(&dest_r.dict_buf[0], &r->dict_buf[0], - dest_r.dict_buf_size); + memcpy(&dest_r.dict_buf[0], &r->dict_buf[0], r->info->dict_len); + dest_r.info->dict_len = r->info->dict_len; } dest_r.info->facility = r->info->facility; dest_r.info->level = r->info->level; diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 6ee5ebce1450..82347abb22a5 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -146,10 +146,13 @@ * * if (prb_reserve(&e, &test_rb, &r)) { * snprintf(r.text_buf, r.text_buf_size, "%s", textstr); + * r.info->text_len = strlen(textstr); * * // dictionary allocation may have failed - * if (r.dict_buf) + * if (r.dict_buf) { * snprintf(r.dict_buf, r.dict_buf_size, "%s", dictstr); + * r.info->dict_len = strlen(dictstr); + * } * * r.info->ts_nsec = local_clock(); * @@ -1125,9 +1128,9 @@ static const char *get_data(struct prb_data_ring *data_ring, * @dict_buf_size is set to 0. Writers must check this before writing to * dictionary space. * - * @info->text_len and @info->dict_len will already be set to @text_buf_size - * and @dict_buf_size, respectively. If dictionary space reservation fails, - * @info->dict_len is set to 0. + * Important: @info->text_len and @info->dict_len need to be set correctly by + *the writer in order for data to be readable and/or extended. + *Their values are initialized to 0. */ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, struct printk_record *r) @@ -1135,6 +1138,7 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, struct prb_desc_ring *desc_ring = &rb->desc_ring; struct prb_desc *d; unsigned long id; + u64 seq; if (!data_check_size(&rb->text_data_ring, r->text_buf_size)) goto fail; @@ -1159,6 +1163,14 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, d = to_desc(desc_ring, id); + /* +* All @info fields (except @seq) are cleared and must be filled in +* by the writer. Save @seq before clearing because it is used to +* determine the new sequence number. +*/ + seq = d->info.seq; + memset(&d->info, 0, sizeof(d->info)); + /* * Set the @e fields here so that prb_commit() can be used if * text data allocation fails. @@ -1177,17 +1189,15 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, * See the "Bootstrap" comment block in printk_ringbuffer.h for * details about how the initializer bootstraps the descriptors. */ - if (d->info.seq == 0 && DESC_INDEX(desc_ring, id) != 0) + if (seq == 0 && DESC_INDEX(desc_ring, id) !=
[PATCH printk v4 0/6] printk: reimplement LOG_CONT handling
Hello, Here is v4 for the second series to rework the printk subsystem. (The v3 is here [0].) This series implements a new ringbuffer feature that allows the last record to be extended. Petr Mladek provided the initial proof of concept [1] for this. Using the record extension feature, LOG_CONT is re-implemented in a way that exactly preserves its behavior, but avoids the need for an extra buffer. In particular, it avoids the need for any synchronization that such a buffer requires. This series deviates from the agreements [2] made at the meeting during LPC2019 in Lisbon. The test results of the v1 series, which implemented LOG_CONT as agreed upon, showed that the effects on existing userspace tools using /dev/kmsg (journalctl, dmesg) were not acceptable [3]. Patch 5 introduces *four* new memory barrier pairs. Two of them are insignificant additions (data_realloc:A/desc_read:D and data_realloc:A/data_push_tail:B) because they are alternate path memory barriers that exactly match the purpose and context of the two existing memory barrier pairs they provide an alternate path for. The other two new memory barrier pairs are significant additions: desc_reopen_last:A / _prb_commit:B - When reopening a descriptor, ensure the state transitions back to desc_reserved before fully trusting the descriptor data. _prb_commit:B / desc_reserve:D - When committing a descriptor, ensure the state transitions to desc_committed before checking the head ID to see if the descriptor needs to be finalized. The test module used to test the ringbuffer is available here [4]. The series is based on the printk-rework branch of the printk git tree: e60768311af8 ("scripts/gdb: update for lockless printk ringbuffer") The list of changes since v3: printk_ringbuffer = - move enum desc_state definition to printk_ringbuffer.h - change enum desc_state to define the exact state values used in the state variable - add DESC_STATE() macro to retrieve the state from the state variable - add DESC_SV() macro to build a state variable value given an ID and state - get_desc_state(): simply return the state value rather than processing state flags - desc_finalized is now a queried state instead of a state flag - desc_read(): always return a set @state_var, even if the descriptor is in an inconsistent state (desc_reopen_last() relies on this) - change state logic that tested for desc_committed to now test for desc_finalized, since this is the new state directly preceding desc_reusable - data_realloc(): add a check if the data block should shrink (and in that case, do not modify the data block, i.e. data blocks will never shrink) - prb_reserve_in_last(): add WARN_ON for unexpected @text_len value - prb_reserve(): save a copy of @seq and use use memset() to clear @info - desc_read_committed_seq(): rename function to desc_read_finalized_seq() since desc_finalized is the desired state for readers - documentation: update state and finalization descriptions printk.c - use @text_len and @dict_len for memcpy() size gdb scripts === - update to use new state representation John Ogness [0] https://lkml.kernel.org/r/20200831011058.6286-1-john.ogn...@linutronix.de [1] https://lkml.kernel.org/r/20200812163908.GH12903@alley [2] https://lkml.kernel.org/r/87k1acz5rx@linutronix.de [3] https://lkml.kernel.org/r/20200811160551.GC12903@alley [4] https://github.com/Linutronix/prb-test.git John Ogness (6): printk: ringbuffer: relocate get_data() printk: ringbuffer: add BLK_DATALESS() macro printk: ringbuffer: clear initial reserved fields printk: ringbuffer: change representation of states printk: ringbuffer: add finalization/extension support printk: reimplement log_cont using record extension Documentation/admin-guide/kdump/gdbmacros.txt | 13 +- kernel/printk/printk.c| 110 +-- kernel/printk/printk_ringbuffer.c | 695 ++ kernel/printk/printk_ringbuffer.h | 35 +- scripts/gdb/linux/dmesg.py| 12 +- 5 files changed, 624 insertions(+), 241 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: state names: vas: Re: [PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
On 2020-09-02, Petr Mladek wrote: >> +static struct prb_desc *desc_reopen_last(struct prb_desc_ring *desc_ring, >> + u32 caller_id, unsigned long *id_out) >> +{ >> +unsigned long prev_state_val; >> +enum desc_state d_state; >> +struct prb_desc desc; >> +struct prb_desc *d; >> +unsigned long id; >> + >> +id = atomic_long_read(&desc_ring->head_id); >> + >> +/* >> + * To minimize unnecessarily reopening a descriptor, first check the >> + * descriptor is in the correct state and has a matching caller ID. >> + */ >> +d_state = desc_read(desc_ring, id, &desc); >> +if (d_state != desc_reserved || >> +!(atomic_long_read(&desc.state_var) & DESC_COMMIT_MASK) || > > This looks like a hack. And similar extra check of the bit is needed > also in desc_read(), see > https://lore.kernel.org/r/878sdvq8kd@jogness.linutronix.de Agreed. > I has been actually getting less and less happy with the inconsistency > between names of the bits and states. > > ... > > First, define 5 desc_states, something like: > > enum desc_state { > desc_miss = -1, /* ID mismatch */ > desc_modified = 0x0, /* reserved, being modified by writer */ I prefer the "desc_reserved" name. It may or may not have be modified yet. > desc_committed = 0x1, /* committed by writer, could get reopened */ > desc_finalized = 0x2, /* committed, could not longer get modified */ > desc_reusable = 0x3, /* free, not yet used by any writer */ > }; > > Second, only 4 variants of the 3 state bits are currently used. > It means that two bits are enough and they might use exactly > the above names: > > I mean to do something like: > > #define DESC_SV_BITS (sizeof(unsigned long) * 8) > #define DESC_SV(desc_state) ((unsigned long)desc_state << (DESC_SV_BITS - > 2)) > #define DESC_ST(state_val)((unsigned long)state_val >> (DESC_SV_BITS - 2)) This makes sense and will get us back the bit we lost because of finalization. > I am sorry that I did not came up with this earlier. I know how > painful it is to rework bigger patchsets. But it affects format > of the ring buffer, so we should do it early. Agreed. I am wondering if VMCOREINFO should include a DESC_FLAGS_MASK so that crash tools could at least successfully iterate the ID's, even if they didn't know what all the flag values mean (in the case that more bits are added later). > PS: I am still middle of review. It looks good so far. I wanted to > send this early and separately because it is a bigger change. Thanks for the heads up. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
This critical piece was missing from patch 6... >From 0b745d507f0c38e6d1612ed9468aa52845ca025b Mon Sep 17 00:00:00 2001 From: John Ogness Date: Mon, 31 Aug 2020 14:45:40 +0206 Subject: [PATCH] printk: ringbuffer: allow reading consistent descriptors desc_read() will fail to read if a descriptor is in the desc_reserved queried state because such data would be inconsistent. However, since ("printk: ringbuffer: add finalization/extension support") the desc_reserved state can have the DESC_COMMIT_MASK flag set, in which case it _is_ consistent. And indeed, desc_reopen_last() is expecting a read in this case. Allow desc_read() to read desc_reserved descriptors if the DESC_COMMIT_MASK flag is set. Signed-off-by: John Ogness Reported-by: Andy Lavr Fixes: ("printk: ringbuffer: add finalization/extension support") --- kernel/printk/printk_ringbuffer.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 0731d5e2..6ba7d3fc96f1 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -446,8 +446,10 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring, /* Check the descriptor state. */ state_val = atomic_long_read(state_var); /* LMM(desc_read:A) */ d_state = get_desc_state(id, state_val); - if (d_state != desc_committed && d_state != desc_reusable) + if (d_state == desc_miss || + (d_state == desc_reserved && !(state_val & DESC_COMMIT_MASK))) { return d_state; + } /* * Guarantee the state is loaded before copying the descriptor -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH next v3 6/8] printk: ringbuffer: add finalization/extension support
Add support for extending the newest data block. For this, introduce a new finalization state flag (DESC_FINAL_MASK) that denotes when a descriptor may not be extended, i.e. is finalized. The DESC_COMMIT_MASK is still set when the record data is in a consistent state, i.e. the writer is no longer modifying the record. However, the record remains in the desc_reserved queried state until it is finalized, in which case it transitions to the desc_committed queried state. Until a record is finalized, a writer can reopen that record to append new data. Reopening a record means clearing the DESC_COMMIT_MASK flag. A writer can explicitly finalize a record if there is no intention of extending it. Also, records are automatically finalized when a new record is reserved. This relieves writers of needing to explicitly finalize while also making such records available to readers sooner. (Readers can only traverse finalized records.) Three new memory barrier pairs are introduced. Two of them are not significant because they are alternate path memory barriers that exactly correspond to existing memory barriers. But the third (_prb_commit:B / desc_reserve:D) is new and guarantees that descriptors will always be finalized, either because a descriptor setting DESC_COMMIT_MASK sees that there is a newer descriptor and so finalizes itself or because a new descriptor being reserved sees that the previous descriptor has DESC_COMMIT_MASK set and finalizes that descriptor. Signed-off-by: John Ogness --- kernel/printk/printk_ringbuffer.c | 467 -- kernel/printk/printk_ringbuffer.h | 8 +- 2 files changed, 443 insertions(+), 32 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index da54d4fadf96..0731d5e2 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -49,14 +49,16 @@ * Descriptors have three states: * * reserved - * A writer is modifying the record. + * A writer is modifying the record. Internally represented as either "0" + * or "DESC_COMMIT_MASK". * * committed * The record and all its data are complete and available for reading. + * Internally represented as "DESC_COMMIT_MASK | DESC_FINAL_MASK". * * reusable * The record exists, but its text and/or dictionary data may no longer - * be available. + * be available. Internally represented as "DESC_REUSE_MASK". * * Querying the @state_var of a record requires providing the ID of the * descriptor to query. This can yield a possible fourth (pseudo) state: @@ -79,6 +81,25 @@ * committed or reusable queried state. This makes it possible that a valid * sequence number of the tail is always available. * + * Descriptor Finalization + * ~~~ + * When a writer calls the commit function prb_commit(), the record may still + * continue to be in the reserved queried state. In order for that record to + * enter into the committed queried state, that record also must be finalized. + * A record can be finalized by three different scenarios: + * + * 1) A writer can finalize its record immediately by calling + * prb_final_commit() instead of prb_commit(). + * + * 2) When a new record is reserved and the previous record has been + * committed via prb_commit(), that previous record is finalized. + * + * 3) When a record is committed via prb_commit() and a newer record + * already exists, the record being committed is finalized. + * + * Until a record is finalized (represented by "DESC_FINAL_MASK"), a writer + * may "reopen" that record and extend it with more data. + * * Data Rings * ~~ * The two data rings (text and dictionary) function identically. They exist @@ -156,9 +177,38 @@ * * r.info->ts_nsec = local_clock(); * + * prb_final_commit(&e); + * } + * + * Note that additional writer functions are available to extend a record + * after it has been committed but not yet finalized. This can be done as + * long as no new records have been reserved and the caller is the same. + * + * Sample writer code (record extending):: + * + * // alternate rest of previous example + * r.info->ts_nsec = local_clock(); + * r.info->text_len = strlen(textstr); + * r.info->caller_id = printk_caller_id(); + * + * // commit the record (but do not finalize yet) * prb_commit(&e); * } * + * ... + * + * // specify additional 5 bytes text space to extend + * prb_rec_init_wr(&r, 5, 0); + * + * if (prb_reserve_in_last(&e, &test_rb, &r, printk_caller_id())) { + * snprintf(&r.text_buf[r.info->text_len], + * r.text_buf_size - r.info->text_len, "hello"); + * + * r.info->text_len +=
[PATCH next v3 7/8] printk: reimplement log_cont using record extension
Use the record extending feature of the ringbuffer to implement continuous messages. This preserves the existing continuous message behavior. Signed-off-by: John Ogness --- kernel/printk/printk.c | 98 +- 1 file changed, 20 insertions(+), 78 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 7e7d596c8878..d0b2bea1fd81 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -535,7 +535,10 @@ static int log_store(u32 caller_id, int facility, int level, r.info->caller_id = caller_id; /* insert message */ - prb_commit(&e); + if ((flags & LOG_CONT) || !(flags & LOG_NEWLINE)) + prb_commit(&e); + else + prb_final_commit(&e); return (text_len + trunc_msg_len); } @@ -1093,7 +1096,7 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, dest_r.info->ts_nsec = r->info->ts_nsec; dest_r.info->caller_id = r->info->caller_id; - prb_commit(&e); + prb_final_commit(&e); return prb_record_text_space(&e); } @@ -1893,87 +1896,26 @@ static inline u32 printk_caller_id(void) 0x8000 + raw_smp_processor_id(); } -/* - * Continuation lines are buffered, and not committed to the record buffer - * until the line is complete, or a race forces it. The line fragments - * though, are printed immediately to the consoles to ensure everything has - * reached the console in case of a kernel crash. - */ -static struct cont { - char buf[LOG_LINE_MAX]; - size_t len; /* length == 0 means unused buffer */ - u32 caller_id; /* printk_caller_id() of first print */ - u64 ts_nsec;/* time of first print */ - u8 level; /* log level of first message */ - u8 facility;/* log facility of first message */ - enum log_flags flags; /* prefix, newline flags */ -} cont; - -static void cont_flush(void) -{ - if (cont.len == 0) - return; - - log_store(cont.caller_id, cont.facility, cont.level, cont.flags, - cont.ts_nsec, NULL, 0, cont.buf, cont.len); - cont.len = 0; -} - -static bool cont_add(u32 caller_id, int facility, int level, -enum log_flags flags, const char *text, size_t len) -{ - /* If the line gets too long, split it up in separate records. */ - if (cont.len + len > sizeof(cont.buf)) { - cont_flush(); - return false; - } - - if (!cont.len) { - cont.facility = facility; - cont.level = level; - cont.caller_id = caller_id; - cont.ts_nsec = local_clock(); - cont.flags = flags; - } - - memcpy(cont.buf + cont.len, text, len); - cont.len += len; - - // The original flags come from the first line, - // but later continuations can add a newline. - if (flags & LOG_NEWLINE) { - cont.flags |= LOG_NEWLINE; - cont_flush(); - } - - return true; -} - static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len) { const u32 caller_id = printk_caller_id(); - /* -* If an earlier line was buffered, and we're a continuation -* write from the same context, try to add it to the buffer. -*/ - if (cont.len) { - if (cont.caller_id == caller_id && (lflags & LOG_CONT)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - /* Otherwise, make sure it's flushed */ - cont_flush(); - } - - /* Skip empty continuation lines that couldn't be added - they just flush */ - if (!text_len && (lflags & LOG_CONT)) - return 0; - - /* If it doesn't end in a newline, try to buffer the current line */ - if (!(lflags & LOG_NEWLINE)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) + if (lflags & LOG_CONT) { + struct prb_reserved_entry e; + struct printk_record r; + + prb_rec_init_wr(&r, text_len, 0); + if (prb_reserve_in_last(&e, prb, &r, caller_id)) { + memcpy(&r.text_buf[r.info->text_len], text, text_len); + r.info->text_len += text_len; + if (lflags & LOG_NEWLINE) { + r.info->flags |= LOG_NEWLINE; + prb_final_commit(&e); + } else { +
[PATCH next v3 3/8] printk: ringbuffer: relocate get_data()
Move the internal get_data() function as-is above prb_reserve() so that a later change can make use of the static function. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 116 +++--- 1 file changed, 58 insertions(+), 58 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index d339ff7647da..86af38c2cf77 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -1038,6 +1038,64 @@ static unsigned int space_used(struct prb_data_ring *data_ring, DATA_SIZE(data_ring) - DATA_INDEX(data_ring, blk_lpos->begin)); } +/* + * Given @blk_lpos, return a pointer to the writer data from the data block + * and calculate the size of the data part. A NULL pointer is returned if + * @blk_lpos specifies values that could never be legal. + * + * This function (used by readers) performs strict validation on the lpos + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is + * triggered if an internal error is detected. + */ +static const char *get_data(struct prb_data_ring *data_ring, + struct prb_data_blk_lpos *blk_lpos, + unsigned int *data_size) +{ + struct prb_data_block *db; + + /* Data-less data block description. */ + if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { + *data_size = 0; + return ""; + } + return NULL; + } + + /* Regular data block: @begin less than @next and in same wrap. */ + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && + blk_lpos->begin < blk_lpos->next) { + db = to_block(data_ring, blk_lpos->begin); + *data_size = blk_lpos->next - blk_lpos->begin; + + /* Wrapping data block: @begin is one wrap behind @next. */ + } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == + DATA_WRAPS(data_ring, blk_lpos->next)) { + db = to_block(data_ring, 0); + *data_size = DATA_INDEX(data_ring, blk_lpos->next); + + /* Illegal block description. */ + } else { + WARN_ON_ONCE(1); + return NULL; + } + + /* A valid data block will always be aligned to the ID size. */ + if (WARN_ON_ONCE(blk_lpos->begin != ALIGN(blk_lpos->begin, sizeof(db->id))) || + WARN_ON_ONCE(blk_lpos->next != ALIGN(blk_lpos->next, sizeof(db->id { + return NULL; + } + + /* A valid data block will always have at least an ID. */ + if (WARN_ON_ONCE(*data_size < sizeof(db->id))) + return NULL; + + /* Subtract block ID space from size to reflect data size. */ + *data_size -= sizeof(db->id); + + return &db->data[0]; +} + /** * prb_reserve() - Reserve space in the ringbuffer. * @@ -1192,64 +1250,6 @@ void prb_commit(struct prb_reserved_entry *e) local_irq_restore(e->irqflags); } -/* - * Given @blk_lpos, return a pointer to the writer data from the data block - * and calculate the size of the data part. A NULL pointer is returned if - * @blk_lpos specifies values that could never be legal. - * - * This function (used by readers) performs strict validation on the lpos - * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is - * triggered if an internal error is detected. - */ -static const char *get_data(struct prb_data_ring *data_ring, - struct prb_data_blk_lpos *blk_lpos, - unsigned int *data_size) -{ - struct prb_data_block *db; - - /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { - if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { - *data_size = 0; - return ""; - } - return NULL; - } - - /* Regular data block: @begin less than @next and in same wrap. */ - if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && - blk_lpos->begin < blk_lpos->next) { - db = to_block(data_ring, blk_lpos->begin); - *data_size = blk_lpos->next - blk_lpos->begin; - - /* Wrapping data block: @begin is one wrap behind @next. */ - } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == - DATA_WRAPS(data_ring, blk_lpos->next)) { - db = to_block(data_rin
[PATCH next v3 0/8] printk: reimplement LOG_CONT handling
Hello, Here is v3 for the second series to rework the printk subsystem. (The v2 is here [0].) This series implements a new ringbuffer feature that allows the last record to be extended. Petr Mladek provided the initial proof of concept [1] for this. Using the record extension feature, LOG_CONT is re-implemented in a way that exactly preserves its behavior, but avoids the need for an extra buffer. In particular, it avoids the need for any synchronization that such a buffer requires. This series deviates from the agreements [2] made at the meeting during LPC2019 in Lisbon. The test results of the v1 series, which implemented LOG_CONT as agreed upon, showed that the effects on existing userspace tools using /dev/kmsg (journalctl, dmesg) were not acceptable [3]. The main difference to v2 is the implementation of the new descriptor finalization. For v3 the implementation closely follows the example [4] from Petr Mladek. Patch 6 introduces *four* new memory barrier pairs. Two of them are insignificant additions (data_realloc:A/desc_read:D and data_realloc:A/data_push_tail:B) because they are alternate path memory barriers that exactly match the purpose and context of the two existing memory barrier pairs they provide an alternate path for. The other two new memory barrier pairs are significant additions: desc_reopen_last:A/_prb_commit:B - When reopening a descriptor, ensure the commit flag is removed before fully trusing the descriptor data. _prb_commit:B / desc_reserve:D - When committing a descriptor, ensure the commit flag is set before checking the head ID to see if the finalize flag should be set. Patch 8 assumes the gdb script series [5] for the new printk ringbuffer has been applied. The test module used to test the ringbuffer is available here [6]. The series is based on next-20200828. The list of changes since v2: printk_ringbuffer = - prb_commit(): finalize self if no longer the head - prb_reserve(): clear @info fields on success - prb_reserve(): do not finalize the -1 placeholder descriptor - desc_make_final(): renamed from desc_finalize() - desc_make_final(): remove loop, change to single shot attempt - prb_reserve_in_last(): renamed from prb_reserve_last() - prb_reserve_in_last(): add new fail goto target - prb_reserve_in_last(): fix logic for calculating @text_buf_size and add size check - desc_reopen_last(): add extra caller ID check before reopening - desc_reopen_last(): change cmpcxhg() to full memory barrier - get_desc_state(): remove unneeded @is_final argument - documentation: update finalization, sample code, and memory barrier list printk.c - set @text_len and @dict_len as required by prb_reserve() change John Ogness [0] https://lkml.kernel.org/r/20200824103538.31446-1-john.ogn...@linutronix.de [1] https://lkml.kernel.org/r/20200812163908.GH12903@alley [2] https://lkml.kernel.org/r/87k1acz5rx@linutronix.de [3] https://lkml.kernel.org/r/20200811160551.GC12903@alley [4] https://lkml.kernel.org/r/20200827151710.GB4928@alley [5] https://lkml.kernel.org/r/CAHk-=wj_b6Bh=d-wwh0xyqoqbhhkyeexhszkpxdra6gjtvk...@mail.gmail.com [6] https://lkml.kernel.org/r/20200814212525.6118-1-john.ogn...@linutronix.de [7] https://github.com/Linutronix/prb-test.git John Ogness (8): printk: ringbuffer: rename DESC_COMMITTED_MASK flag printk: ringbuffer: change representation of reusable printk: ringbuffer: relocate get_data() printk: ringbuffer: add BLK_DATALESS() macro printk: ringbuffer: clear initial reserved fields printk: ringbuffer: add finalization/extension support printk: reimplement log_cont using record extension scripts/gdb: support printk finalized records Documentation/admin-guide/kdump/gdbmacros.txt | 10 +- kernel/printk/printk.c| 105 +-- kernel/printk/printk_ringbuffer.c | 604 +++--- kernel/printk/printk_ringbuffer.h | 12 +- scripts/gdb/linux/dmesg.py| 10 +- 5 files changed, 558 insertions(+), 183 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH next v3 1/8] printk: ringbuffer: rename DESC_COMMITTED_MASK flag
An upcoming ringbuffer support for continuous lines will allow to reopen records with DESC_COMMITTED_MASK set. As a result, the flag will no longer describe the final committed state. Rename it to DESC_COMMIT_MASK as a preparation step. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 8 kernel/printk/printk_ringbuffer.h | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 0659b50872b5..76248c82d557 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -361,7 +361,7 @@ static enum desc_state get_desc_state(unsigned long id, if (state_val & DESC_REUSE_MASK) return desc_reusable; - if (state_val & DESC_COMMITTED_MASK) + if (state_val & DESC_COMMIT_MASK) return desc_committed; return desc_reserved; @@ -462,7 +462,7 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring, static void desc_make_reusable(struct prb_desc_ring *desc_ring, unsigned long id) { - unsigned long val_committed = id | DESC_COMMITTED_MASK; + unsigned long val_committed = id | DESC_COMMIT_MASK; unsigned long val_reusable = val_committed | DESC_REUSE_MASK; struct prb_desc *desc = to_desc(desc_ring, id); atomic_long_t *state_var = &desc->state_var; @@ -899,7 +899,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) */ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */ if (prev_state_val && - prev_state_val != (id_prev_wrap | DESC_COMMITTED_MASK | DESC_REUSE_MASK)) { + prev_state_val != (id_prev_wrap | DESC_COMMIT_MASK | DESC_REUSE_MASK)) { WARN_ON_ONCE(1); return false; } @@ -1184,7 +1184,7 @@ void prb_commit(struct prb_reserved_entry *e) * this. This pairs with desc_read:B. */ if (!atomic_long_try_cmpxchg(&d->state_var, &prev_state_val, -e->id | DESC_COMMITTED_MASK)) { /* LMM(prb_commit:B) */ +e->id | DESC_COMMIT_MASK)) { /* LMM(prb_commit:B) */ WARN_ON_ONCE(1); } diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h index e6302da041f9..dcda5e9b4676 100644 --- a/kernel/printk/printk_ringbuffer.h +++ b/kernel/printk/printk_ringbuffer.h @@ -115,9 +115,9 @@ struct prb_reserved_entry { #define _DATA_SIZE(sz_bits)(1UL << (sz_bits)) #define _DESCS_COUNT(ct_bits) (1U << (ct_bits)) #define DESC_SV_BITS (sizeof(unsigned long) * 8) -#define DESC_COMMITTED_MASK(1UL << (DESC_SV_BITS - 1)) +#define DESC_COMMIT_MASK (1UL << (DESC_SV_BITS - 1)) #define DESC_REUSE_MASK(1UL << (DESC_SV_BITS - 2)) -#define DESC_FLAGS_MASK(DESC_COMMITTED_MASK | DESC_REUSE_MASK) +#define DESC_FLAGS_MASK(DESC_COMMIT_MASK | DESC_REUSE_MASK) #define DESC_ID_MASK (~DESC_FLAGS_MASK) #define DESC_ID(sv)((sv) & DESC_ID_MASK) #define FAILED_LPOS0x1 @@ -213,7 +213,7 @@ struct prb_reserved_entry { */ #define BLK0_LPOS(sz_bits) (-(_DATA_SIZE(sz_bits))) #define DESC0_ID(ct_bits) DESC_ID(-(_DESCS_COUNT(ct_bits) + 1)) -#define DESC0_SV(ct_bits) (DESC_COMMITTED_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits)) +#define DESC0_SV(ct_bits) (DESC_COMMIT_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits)) /* * Define a ringbuffer with an external text data buffer. The same as -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH next v3 8/8] scripts/gdb: support printk finalized records
With commit ("printk: ringbuffer: add finalization/extension support") a new state bit for finalized records was added. This not only changed the bit representation of committed records, but also reduced the size for record IDs. Update the gdb scripts to correctly interpret the state variable. Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 10 +++--- scripts/gdb/linux/dmesg.py| 10 ++ 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 7adece30237e..bcb78368b381 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -295,8 +295,11 @@ document dump_record end define dmesg - set var $desc_committed = 1UL << ((sizeof(long) * 8) - 1) - set var $flags_mask = 3UL << ((sizeof(long) * 8) - 2) + # definitions from kernel/printk/printk_ringbuffer.h + set var $desc_commit = 1UL << ((sizeof(long) * 8) - 1) + set var $desc_final = 1UL << ((sizeof(long) * 8) - 2) + set var $desc_reuse = 1UL << ((sizeof(long) * 8) - 3) + set var $flags_mask = $desc_commit | $desc_final | $desc_reuse set var $id_mask = ~$flags_mask set var $desc_count = 1U << prb->desc_ring.count_bits @@ -309,7 +312,8 @@ define dmesg set var $desc = &prb->desc_ring.descs[$id % $desc_count] # skip non-committed record - if (($desc->state_var.counter & $flags_mask) == $desc_committed) + # (note that commit+!final records will be displayed) + if (($desc->state_var.counter & $desc_commit) == $desc_commit) dump_record $desc $prev_flags set var $prev_flags = $desc->info.flags end diff --git a/scripts/gdb/linux/dmesg.py b/scripts/gdb/linux/dmesg.py index 6c6022012ea8..367523c5c270 100644 --- a/scripts/gdb/linux/dmesg.py +++ b/scripts/gdb/linux/dmesg.py @@ -79,9 +79,10 @@ class LxDmesg(gdb.Command): # definitions from kernel/printk/printk_ringbuffer.h desc_sv_bits = utils.get_long_type().sizeof * 8 -desc_committed_mask = 1 << (desc_sv_bits - 1) -desc_reuse_mask = 1 << (desc_sv_bits - 2) -desc_flags_mask = desc_committed_mask | desc_reuse_mask +desc_commit_mask = 1 << (desc_sv_bits - 1) +desc_final_mask = 1 << (desc_sv_bits - 2) +desc_reuse_mask = 1 << (desc_sv_bits - 3) +desc_flags_mask = desc_commit_mask | desc_final_mask | desc_reuse_mask desc_id_mask = ~desc_flags_mask # read in tail and head descriptor ids @@ -96,8 +97,9 @@ class LxDmesg(gdb.Command): desc_off = desc_sz * ind # skip non-committed record +# (note that commit+!final records will be displayed) state = utils.read_u64(descs, desc_off + sv_off + counter_off) & desc_flags_mask -if state != desc_committed_mask: +if state & desc_commit_mask != desc_commit_mask: if did == head_id: break did = (did + 1) & desc_id_mask -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH next v3 2/8] printk: ringbuffer: change representation of reusable
The reusable queried state is represented by the combined flags: DESC_COMMIT_MASK | DESC_REUSE_MASK There is no reason for the DESC_COMMIT_MASK to be part of that representation. In particular, this will add confusion when more state flags are available. Change the representation of the reusable queried state to just the DESC_REUSE_MASK flag. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 4 ++-- kernel/printk/printk_ringbuffer.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 76248c82d557..d339ff7647da 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -463,7 +463,7 @@ static void desc_make_reusable(struct prb_desc_ring *desc_ring, unsigned long id) { unsigned long val_committed = id | DESC_COMMIT_MASK; - unsigned long val_reusable = val_committed | DESC_REUSE_MASK; + unsigned long val_reusable = id | DESC_REUSE_MASK; struct prb_desc *desc = to_desc(desc_ring, id); atomic_long_t *state_var = &desc->state_var; @@ -899,7 +899,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) */ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */ if (prev_state_val && - prev_state_val != (id_prev_wrap | DESC_COMMIT_MASK | DESC_REUSE_MASK)) { + get_desc_state(id_prev_wrap, prev_state_val) != desc_reusable) { WARN_ON_ONCE(1); return false; } diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h index dcda5e9b4676..96ef997d7bd6 100644 --- a/kernel/printk/printk_ringbuffer.h +++ b/kernel/printk/printk_ringbuffer.h @@ -213,7 +213,7 @@ struct prb_reserved_entry { */ #define BLK0_LPOS(sz_bits) (-(_DATA_SIZE(sz_bits))) #define DESC0_ID(ct_bits) DESC_ID(-(_DESCS_COUNT(ct_bits) + 1)) -#define DESC0_SV(ct_bits) (DESC_COMMIT_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits)) +#define DESC0_SV(ct_bits) (DESC_REUSE_MASK | DESC0_ID(ct_bits)) /* * Define a ringbuffer with an external text data buffer. The same as -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH next v3 5/8] printk: ringbuffer: clear initial reserved fields
prb_reserve() will set some meta data values and leave others uninitialized (or rather, containing the values of the previous wrap). Simplify the API by always clearing out all the fields. Only the sequence number is filled in. The caller is now responsible for filling in the rest of the meta data fields. In particular, for correctly filling in text and dict lengths. Signed-off-by: John Ogness --- kernel/printk/printk.c| 7 ++- kernel/printk/printk_ringbuffer.c | 29 +++-- 2 files changed, 25 insertions(+), 11 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index ad8d1dfe5fbe..7e7d596c8878 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -520,8 +520,11 @@ static int log_store(u32 caller_id, int facility, int level, memcpy(&r.text_buf[0], text, text_len); if (trunc_msg_len) memcpy(&r.text_buf[text_len], trunc_msg, trunc_msg_len); - if (r.dict_buf) + r.info->text_len = text_len + trunc_msg_len; + if (r.dict_buf) { memcpy(&r.dict_buf[0], dict, dict_len); + r.info->dict_len = dict_len; + } r.info->facility = facility; r.info->level = level & 7; r.info->flags = flags & 0x1f; @@ -1078,9 +1081,11 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, return 0; memcpy(&dest_r.text_buf[0], &r->text_buf[0], dest_r.text_buf_size); + dest_r.info->text_len = r->info->text_len; if (dest_r.dict_buf) { memcpy(&dest_r.dict_buf[0], &r->dict_buf[0], dest_r.dict_buf_size); + dest_r.info->dict_len = r->info->dict_len; } dest_r.info->facility = r->info->facility; dest_r.info->level = r->info->level; diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index d66718e74aae..da54d4fadf96 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -146,10 +146,13 @@ * * if (prb_reserve(&e, &test_rb, &r)) { * snprintf(r.text_buf, r.text_buf_size, "%s", textstr); + * r.info->text_len = strlen(textstr); * * // dictionary allocation may have failed - * if (r.dict_buf) + * if (r.dict_buf) { * snprintf(r.dict_buf, r.dict_buf_size, "%s", dictstr); + * r.info->dict_len = strlen(dictstr); + * } * * r.info->ts_nsec = local_clock(); * @@ -1125,9 +1128,9 @@ static const char *get_data(struct prb_data_ring *data_ring, * @dict_buf_size is set to 0. Writers must check this before writing to * dictionary space. * - * @info->text_len and @info->dict_len will already be set to @text_buf_size - * and @dict_buf_size, respectively. If dictionary space reservation fails, - * @info->dict_len is set to 0. + * Important: @info->text_len and @info->dict_len need to be set correctly by + *the writer in order for data to be readable and/or extended. + *Their values are initialized to 0. */ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, struct printk_record *r) @@ -1159,6 +1162,18 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, d = to_desc(desc_ring, id); + /* +* Clear all @info fields except for @seq, which is used to determine +* the new sequence number. The writer must fill in new values. +*/ + d->info.ts_nsec = 0; + d->info.text_len = 0; + d->info.dict_len = 0; + d->info.facility = 0; + d->info.flags = 0; + d->info.level = 0; + d->info.caller_id = 0; + /* * Set the @e fields here so that prb_commit() can be used if * text data allocation fails. @@ -1186,8 +1201,6 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, &d->text_blk_lpos, id); /* If text data allocation fails, a data-less record is committed. */ if (r->text_buf_size && !r->text_buf) { - d->info.text_len = 0; - d->info.dict_len = 0; prb_commit(e); /* prb_commit() re-enabled interrupts. */ goto fail; @@ -1204,10 +1217,6 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, r->info = &d->info; - /* Set default values for the sizes. */ - d->info.text_len = r->text_buf_size; - d->info.dict_len = r->dict_buf_size; - /* Record full text space used by record. */ e->text_space = space_used(&rb->text_data_ring, &d->text_blk_lpos); -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH next v3 4/8] printk: ringbuffer: add BLK_DATALESS() macro
Rather than continually needing to explicitly check @begin and @next to identify a dataless block, introduce and use a BLK_DATALESS() macro. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk_ringbuffer.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 86af38c2cf77..d66718e74aae 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -266,6 +266,8 @@ /* Determine if a logical position refers to a data-less block. */ #define LPOS_DATALESS(lpos)((lpos) & 1UL) +#define BLK_DATALESS(blk) (LPOS_DATALESS((blk)->begin) && \ +LPOS_DATALESS((blk)->next)) /* Get the logical position at index 0 of the current wrap. */ #define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \ @@ -1021,7 +1023,7 @@ static unsigned int space_used(struct prb_data_ring *data_ring, struct prb_data_blk_lpos *blk_lpos) { /* Data-less blocks take no space. */ - if (LPOS_DATALESS(blk_lpos->begin)) + if (BLK_DATALESS(blk_lpos)) return 0; if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) { @@ -1054,7 +1056,7 @@ static const char *get_data(struct prb_data_ring *data_ring, struct prb_data_block *db; /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (BLK_DATALESS(blk_lpos)) { if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { *data_size = 0; return ""; -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 5/7][next] printk: ringbuffer: add finalization/extension support
On 2020-08-28, Petr Mladek wrote: >> Below is a patch against this series that adds support for finalizing >> all 4 queried states. It passes all my tests. Note that the code handles >> 2 corner cases: >> >> 1. When seq is 0, there is no previous descriptor to finalize. This >>exception is important because we don't want to finalize the -1 >>placeholder. Otherwise, upon the first wrap, a descriptor will be >>prematurely finalized. >> >> 2. When a previous descriptor is being reserved for the first time, it >>might have a state_var value of 0 because the writer is still in >>prb_reserve() and has not set the initial value yet. I added >>considerable comments on this special case. >> >> I am comfortable with adding this new code, although it clearly adds >> complexity. >> >> John Ogness >> >> diff --git a/kernel/printk/printk_ringbuffer.c >> b/kernel/printk/printk_ringbuffer.c >> index 90d48973ac9e..1ed1e9eb930f 100644 >> --- a/kernel/printk/printk_ringbuffer.c >> +++ b/kernel/printk/printk_ringbuffer.c >> @@ -860,9 +860,11 @@ static bool desc_reserve(struct printk_ringbuffer *rb, >> unsigned long *id_out) >> struct prb_desc_ring *desc_ring = &rb->desc_ring; >> unsigned long prev_state_val; >> unsigned long id_prev_wrap; >> +unsigned long state_val; >> struct prb_desc *desc; >> unsigned long head_id; >> unsigned long id; >> +bool is_final; >> >> head_id = atomic_long_read(&desc_ring->head_id); /* LMM(desc_reserve:A) >> */ >> >> @@ -953,10 +955,17 @@ static bool desc_reserve(struct printk_ringbuffer *rb, >> unsigned long *id_out) >> * See "ABA Issues" about why this verification is performed. >> */ >> prev_state_val = atomic_long_read(&desc->state_var); /* >> LMM(desc_reserve:E) */ >> -if (prev_state_val && >> -get_desc_state(id_prev_wrap, prev_state_val, NULL) != >> desc_reusable) { >> -WARN_ON_ONCE(1); >> -return false; >> +if (get_desc_state(id_prev_wrap, prev_state_val, &is_final) != >> desc_reusable) { >> +/* >> + * If this descriptor has never been used, @prev_state_val >> + * will be 0. However, even though it may have never been >> + * used, it may have been finalized. So that flag must be >> + * ignored. >> + */ >> +if ((prev_state_val & ~DESC_FINAL_MASK)) { >> +WARN_ON_ONCE(1); >> +return false; >> +} >> } >> >> /* >> @@ -967,10 +976,25 @@ static bool desc_reserve(struct printk_ringbuffer *rb, >> unsigned long *id_out) >> * any other changes. A write memory barrier is sufficient for this. >> * This pairs with desc_read:D. >> */ >> -if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val, >> - id | 0)) { /* LMM(desc_reserve:F) */ >> -WARN_ON_ONCE(1); >> -return false; >> +if (is_final) >> +state_val = id | 0 | DESC_FINAL_MASK; > > The state from the previous wrap always have to have DESC_FINAL_MASK set. > Do I miss something, please? Important: FINAL is not a _state_. It is a _flag_ that marks a descriptor as non-reopenable. This was a simple change because it does not affect any state logic. The number of states and possible transitions have not changed. When a descriptor transitions to reusable, the FINAL flag is cleared. It has reached the end of its lifecycle. See desc_make_reusable(). (In order to have transitioned to reusable, the FINAL and COMMIT flags must have been set.) In the case of desc_reserve(), a reusable descriptor is transitioning to reserved. When this transition happens, there may already be a later descriptor that has been reserved and finalized this descriptor. If the FINAL flag is set here, it means that the FINAL flag is set for the _new_ descriptor being reserved. In summary, the FINAL flag can be set in _any_ state. Once set, it is preserved for all further state transitions. And it is cleared when that descriptor becomes reusable. >> +else >> +state_val = id | 0; >> +if (atomic_long_cmpxchg(&desc->state_var, prev_state_val, >> +state_val) != prev_state_val) { /* >> LMM(desc_reserve:F) */ >> +/* >> + * This reusable descriptor must have been final
Re: [PATCH v2 5/7][next] printk: ringbuffer: add finalization/extension support
4 +1518,66 @@ void prb_commit(struct prb_reserved_entry *e) >> * this. This pairs with desc_read:B. >> */ >> if (!atomic_long_try_cmpxchg(&d->state_var, &prev_state_val, >> - e->id | DESC_COMMIT_MASK)) { /* >> LMM(prb_commit:B) */ >> -WARN_ON_ONCE(1); >> + e->id | DESC_COMMIT_MASK | >> + final_mask)) { /* >> LMM(_prb_commit:B) */ >> +/* >> + * This reserved descriptor must have been finalized already. >> + * Retry with a reserved+final expected value. >> + */ >> +prev_state_val = e->id | 0 | DESC_FINAL_MASK; > > This does not make sense to me. The state "e->id | 0 | DESC_FINAL_MASK" > must never happen. It would mean that someone finalized > record that is still being modified. Correct. Setting the FINAL flag means the descriptor cannot be _reopened_. It has nothing to do with the current state of the descriptor. Once the FINAL flag is set, it remains set for the remaining lifetime of that record. > Or we both have different understanding of the logic. Yes. > Well, there are actually two approaches: > >+ I originally expected that FINAL bit could be set only when > COMMIT bit is set. But this brings the problems that prb_commit() > would need to set FINAL when it is not longer the last descriptor. My first attempt was to implement this. It turned out complex because it involves descriptors finalizing themselves _and_ descriptors finalizing their predecessor. This required two new memory barrier pairs: - between a writer committing and re-checking the head_id that another writer may have modified - between a writer setting the state and another writer checking that state After re-evaluating the purpose of the FINAL flag, I decided that it would be simpler to implement the 2nd approach (below) and would not require any new memory barrier pairs. >+ Another approach is that FINAL bit could be set even when the > COMMIT is not set. It would always be set by the next > prb_reserve(). But it causes that there are more possible > combinations of COMMIT and FINAL bits. As a result, the caller > would need try more variants of the cmpxchg() calls. And > it creates another races/cycles, ... It does not cause more races. And I don't see where it will cause more cmpxchg() calls. It probably _does_ lead to more cmpxchg() _code_. But those are fallbacks for when the common case fails. > I guess that you wanted to implement the 2nd approach and ended in > many troubles. I wonder if the 1st approach might be easier. Well, the "many troubles" were due to my naive assumption about the previous descriptor state. Once I realized that, the missing piece was obvious. I will reconsider the first approach. Perhaps adding memory barriers is preferable if it reduces lines of code. And we will need to clarify partial continuous line reading because right now that will not work. John Ogness [0] https://lkml.kernel.org/r/875z9nvvl2@jogness.linutronix.de ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 5/7][next] printk: ringbuffer: add finalization/extension support
On 2020-08-26, Petr Mladek wrote: >> This series makes a very naive assumption that the previous >> descriptor is either in the reserved or committed queried states. The >> fact is, it can be in any of the 4 queried states. Adding support for >> finalization of all the states then gets quite complex, since any >> state transition (cmpxchg) may have to deal with an unexpected FINAL >> flag. > > It has to be done in two steps to avoid race: > > prb_commit() > >+ set PRB_COMMIT_MASK >+ check if it is still the last descriptor in the array >+ set PRB_FINAL_MASK when it is not the last descriptor > > It should work because prb_reserve() finalizes the previous > descriptor after the new one is reserved. As a result: > >+ prb_reserve() should either see PRB_COMMIT_MASK in the previous > descriptor and be able to finalize it. > >+ or prb_commit() will see that the head moved and it is not > longer the last reserved one. I do not like the idea of relying on descriptors to finalize themselves. I worry that there might be some hole there. Failing to finalize basically disables printk, so that is pretty serious. Below is a patch against this series that adds support for finalizing all 4 queried states. It passes all my tests. Note that the code handles 2 corner cases: 1. When seq is 0, there is no previous descriptor to finalize. This exception is important because we don't want to finalize the -1 placeholder. Otherwise, upon the first wrap, a descriptor will be prematurely finalized. 2. When a previous descriptor is being reserved for the first time, it might have a state_var value of 0 because the writer is still in prb_reserve() and has not set the initial value yet. I added considerable comments on this special case. I am comfortable with adding this new code, although it clearly adds complexity. John Ogness diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 90d48973ac9e..1ed1e9eb930f 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -860,9 +860,11 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) struct prb_desc_ring *desc_ring = &rb->desc_ring; unsigned long prev_state_val; unsigned long id_prev_wrap; + unsigned long state_val; struct prb_desc *desc; unsigned long head_id; unsigned long id; + bool is_final; head_id = atomic_long_read(&desc_ring->head_id); /* LMM(desc_reserve:A) */ @@ -953,10 +955,17 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) * See "ABA Issues" about why this verification is performed. */ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */ - if (prev_state_val && - get_desc_state(id_prev_wrap, prev_state_val, NULL) != desc_reusable) { - WARN_ON_ONCE(1); - return false; + if (get_desc_state(id_prev_wrap, prev_state_val, &is_final) != desc_reusable) { + /* +* If this descriptor has never been used, @prev_state_val +* will be 0. However, even though it may have never been +* used, it may have been finalized. So that flag must be +* ignored. +*/ + if ((prev_state_val & ~DESC_FINAL_MASK)) { + WARN_ON_ONCE(1); + return false; + } } /* @@ -967,10 +976,25 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) * any other changes. A write memory barrier is sufficient for this. * This pairs with desc_read:D. */ - if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val, -id | 0)) { /* LMM(desc_reserve:F) */ - WARN_ON_ONCE(1); - return false; + if (is_final) + state_val = id | 0 | DESC_FINAL_MASK; + else + state_val = id | 0; + if (atomic_long_cmpxchg(&desc->state_var, prev_state_val, + state_val) != prev_state_val) { /* LMM(desc_reserve:F) */ + /* +* This reusable descriptor must have been finalized already. +* Retry with a reusable+final expected value. +*/ + prev_state_val |= DESC_FINAL_MASK; + state_val |= DESC_FINAL_MASK; + + if (!atomic_long_try_cmpxchg(&desc->state_var, &prev_state_val, +state_val)) { /* LMM(desc_reserve:FIXME) */ + + WARN_ON_ONCE(1); + return false; +
Re: [PATCH v2 5/7][next] printk: ringbuffer: add finalization/extension support
On 2020-08-26, Sergey Senozhatsky wrote: >>> @@ -1157,6 +1431,14 @@ bool prb_reserve(struct prb_reserved_entry *e, >>> struct printk_ringbuffer *rb, >>> goto fail; >>> } >>> >>> + /* >>> +* New data is about to be reserved. Once that happens, previous >>> +* descriptors are no longer able to be extended. Finalize the >>> +* previous descriptor now so that it can be made available to >>> +* readers (when committed). >>> +*/ >>> + desc_finalize(desc_ring, DESC_ID(id - 1)); >>> + >>> d = to_desc(desc_ring, id); >>> >>> /* >> >> Apparently this is not enough to guarantee that past descriptors are >> finalized. I am able to reproduce a scenario where the finalization >> of a certain descriptor never happens. That leaves the descriptor >> permanently in the reserved queried state, which prevents any new >> records from being created. I am investigating. > > Good to know. I also run into problems: > - broken dmesg (and broken journalctl -f /dev/kmsg poll) and broken > syslog read > > $ strace dmesg > > ... > openat(AT_FDCWD, "/dev/kmsg", O_RDONLY|O_NONBLOCK) = 3 > lseek(3, 0, SEEK_DATA) = 0 > read(3, 0x55dda8c240a8, 8191) = -1 EAGAIN (Resource temporarily > unavailable) > close(3)= 0 > syslog(10 /* SYSLOG_ACTION_SIZE_BUFFER */) = 524288 > mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7f43ea847000 > syslog(3 /* SYSLOG_ACTION_READ_ALL */, "", 524296) = 0 Yes, this a consequence of the problem. The tail is in the reserved queried state, so readers will not advance beyond it. This series makes a very naive assumption that the previous descriptor is either in the reserved or committed queried states. The fact is, it can be in any of the 4 queried states. Adding support for finalization of all the states then gets quite complex, since any state transition (cmpxchg) may have to deal with an unexpected FINAL flag. The ringbuffer was designed so that descriptors are completely self-contained. So adding logic where an action on one descriptor should affect another descriptor is far more complex than I initially expected. Keep in mind the finalization concept satisfies 3 things: - denote if a record can be extended (i.e. transition back to reserved) - denote if a reader may read the record - denote if a writer may recycle a record I have not yet given up on the idea of finalization (particularly because it allows mainline LOG_CONT behavior to be preserved locklessy), but I am no longer sure if this is the direction we want to take. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v2 5/7][next] printk: ringbuffer: add finalization/extension support
On 2020-08-24, John Ogness wrote: > @@ -1157,6 +1431,14 @@ bool prb_reserve(struct prb_reserved_entry *e, struct > printk_ringbuffer *rb, > goto fail; > } > > + /* > + * New data is about to be reserved. Once that happens, previous > + * descriptors are no longer able to be extended. Finalize the > + * previous descriptor now so that it can be made available to > + * readers (when committed). > + */ > + desc_finalize(desc_ring, DESC_ID(id - 1)); > + > d = to_desc(desc_ring, id); > > /* Apparently this is not enough to guarantee that past descriptors are finalized. I am able to reproduce a scenario where the finalization of a certain descriptor never happens. That leaves the descriptor permanently in the reserved queried state, which prevents any new records from being created. I am investigating. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 5/7][next] printk: ringbuffer: add finalization/extension support
Add support for extending the last data block. For this, introduce a new finalization state flag that identifies if a descriptor may be extended. When a writer calls the commit function prb_commit(), the record may still continue to be in the reserved queried state. In order for that record to enter into the committed queried state, that record also must be finalized. Finalization can occur anytime while the record is in the reserved queried state, even before the writer has called prb_commit(). Until a record is finalized (represented by "DESC_FINAL_MASK"), a writer may "reopen" that record and extend it with more data. Note that existing descriptors are automatically finalized whenever new descriptors are created. A record can never be "unfinalized". Two new memory barrier pairs are introduced, but these are really just alternate path barriers that exactly correspond to existing memory barriers. Signed-off-by: John Ogness --- kernel/printk/printk.c| 4 +- kernel/printk/printk_ringbuffer.c | 386 +++--- kernel/printk/printk_ringbuffer.h | 8 +- 3 files changed, 364 insertions(+), 34 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index ad8d1dfe5fbe..e063edd8adc2 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -532,7 +532,7 @@ static int log_store(u32 caller_id, int facility, int level, r.info->caller_id = caller_id; /* insert message */ - prb_commit(&e); + prb_commit_finalize(&e); return (text_len + trunc_msg_len); } @@ -1088,7 +1088,7 @@ static unsigned int __init add_to_rb(struct printk_ringbuffer *rb, dest_r.info->ts_nsec = r->info->ts_nsec; dest_r.info->caller_id = r->info->caller_id; - prb_commit(&e); + prb_commit_finalize(&e); return prb_record_text_space(&e); } diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index d66718e74aae..90d48973ac9e 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -49,14 +49,16 @@ * Descriptors have three states: * * reserved - * A writer is modifying the record. + * A writer is modifying the record. Internally represented as either "0" + * or "DESC_FINAL_MASK" or "DESC_COMMIT_MASK". * * committed * The record and all its data are complete and available for reading. + * Internally represented as "DESC_COMMIT_MASK | DESC_FINAL_MASK". * * reusable * The record exists, but its text and/or dictionary data may no longer - * be available. + * be available. Internally represented as "DESC_REUSE_MASK". * * Querying the @state_var of a record requires providing the ID of the * descriptor to query. This can yield a possible fourth (pseudo) state: @@ -79,6 +81,20 @@ * committed or reusable queried state. This makes it possible that a valid * sequence number of the tail is always available. * + * Descriptor Finalization + * ~~~ + * When a writer calls the commit function prb_commit(), the record may still + * continue to be in the reserved queried state. In order for that record to + * enter into the committed queried state, that record also must be finalized. + * Finalization can occur anytime while the record is in the reserved queried + * state, even before the writer has called prb_commit(). + * + * Until a record is finalized (represented by "DESC_FINAL_MASK"), a writer + * may "reopen" that record and extend it with more data. + * + * Note that existing descriptors are automatically finalized whenever new + * descriptors are created. A record can never be "unfinalized". + * * Data Rings * ~~ * The two data rings (text and dictionary) function identically. They exist @@ -153,9 +169,38 @@ * * r.info->ts_nsec = local_clock(); * + * prb_commit_finalize(&e); + * } + * + * Note that additional writer functions are available to extend a record + * after it has been committed but not yet finalized. This can be done as + * long as no new records have been reserved and the caller is the same. + * + * Sample writer code (record extending):: + * + * // alternate rest of previous example + * r.info->ts_nsec = local_clock(); + * r.info->text_len = strlen(textstr); + * r.info->caller_id = printk_caller_id(); + * + * // commit the record (but do not finalize yet) * prb_commit(&e); * } * + * ... + * + * // specify additional 5 bytes text space to extend + * prb_rec_init_wr(&r, 5, 0); + * + * if (prb_reserve_last(&e, &test_rb, &r, printk_caller_id())) { + * snprintf(&r.text_buf[r.info->text_len], +
[PATCH v2 2/7][next] printk: ringbuffer: change representation of reusable
The reusable queried state is represented by the combined flags: DESC_COMMIT_MASK | DESC_REUSE_MASK There is no reason for the DESC_COMMIT_MASK to be part of that representation. In particular, this will add confusion when more state flags are available. Change the representation of the reusable queried state to just the DESC_REUSE_MASK flag. Signed-off-by: John Ogness --- kernel/printk/printk_ringbuffer.c | 4 ++-- kernel/printk/printk_ringbuffer.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 76248c82d557..d339ff7647da 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -463,7 +463,7 @@ static void desc_make_reusable(struct prb_desc_ring *desc_ring, unsigned long id) { unsigned long val_committed = id | DESC_COMMIT_MASK; - unsigned long val_reusable = val_committed | DESC_REUSE_MASK; + unsigned long val_reusable = id | DESC_REUSE_MASK; struct prb_desc *desc = to_desc(desc_ring, id); atomic_long_t *state_var = &desc->state_var; @@ -899,7 +899,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) */ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */ if (prev_state_val && - prev_state_val != (id_prev_wrap | DESC_COMMIT_MASK | DESC_REUSE_MASK)) { + get_desc_state(id_prev_wrap, prev_state_val) != desc_reusable) { WARN_ON_ONCE(1); return false; } diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h index dcda5e9b4676..96ef997d7bd6 100644 --- a/kernel/printk/printk_ringbuffer.h +++ b/kernel/printk/printk_ringbuffer.h @@ -213,7 +213,7 @@ struct prb_reserved_entry { */ #define BLK0_LPOS(sz_bits) (-(_DATA_SIZE(sz_bits))) #define DESC0_ID(ct_bits) DESC_ID(-(_DESCS_COUNT(ct_bits) + 1)) -#define DESC0_SV(ct_bits) (DESC_COMMIT_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits)) +#define DESC0_SV(ct_bits) (DESC_REUSE_MASK | DESC0_ID(ct_bits)) /* * Define a ringbuffer with an external text data buffer. The same as -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 3/7][next] printk: ringbuffer: relocate get_data()
Move the internal get_data() function as-is above prb_reserve() so that a later change can make use of the static function. Signed-off-by: John Ogness --- kernel/printk/printk_ringbuffer.c | 116 +++--- 1 file changed, 58 insertions(+), 58 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index d339ff7647da..86af38c2cf77 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -1038,6 +1038,64 @@ static unsigned int space_used(struct prb_data_ring *data_ring, DATA_SIZE(data_ring) - DATA_INDEX(data_ring, blk_lpos->begin)); } +/* + * Given @blk_lpos, return a pointer to the writer data from the data block + * and calculate the size of the data part. A NULL pointer is returned if + * @blk_lpos specifies values that could never be legal. + * + * This function (used by readers) performs strict validation on the lpos + * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is + * triggered if an internal error is detected. + */ +static const char *get_data(struct prb_data_ring *data_ring, + struct prb_data_blk_lpos *blk_lpos, + unsigned int *data_size) +{ + struct prb_data_block *db; + + /* Data-less data block description. */ + if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { + *data_size = 0; + return ""; + } + return NULL; + } + + /* Regular data block: @begin less than @next and in same wrap. */ + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && + blk_lpos->begin < blk_lpos->next) { + db = to_block(data_ring, blk_lpos->begin); + *data_size = blk_lpos->next - blk_lpos->begin; + + /* Wrapping data block: @begin is one wrap behind @next. */ + } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == + DATA_WRAPS(data_ring, blk_lpos->next)) { + db = to_block(data_ring, 0); + *data_size = DATA_INDEX(data_ring, blk_lpos->next); + + /* Illegal block description. */ + } else { + WARN_ON_ONCE(1); + return NULL; + } + + /* A valid data block will always be aligned to the ID size. */ + if (WARN_ON_ONCE(blk_lpos->begin != ALIGN(blk_lpos->begin, sizeof(db->id))) || + WARN_ON_ONCE(blk_lpos->next != ALIGN(blk_lpos->next, sizeof(db->id { + return NULL; + } + + /* A valid data block will always have at least an ID. */ + if (WARN_ON_ONCE(*data_size < sizeof(db->id))) + return NULL; + + /* Subtract block ID space from size to reflect data size. */ + *data_size -= sizeof(db->id); + + return &db->data[0]; +} + /** * prb_reserve() - Reserve space in the ringbuffer. * @@ -1192,64 +1250,6 @@ void prb_commit(struct prb_reserved_entry *e) local_irq_restore(e->irqflags); } -/* - * Given @blk_lpos, return a pointer to the writer data from the data block - * and calculate the size of the data part. A NULL pointer is returned if - * @blk_lpos specifies values that could never be legal. - * - * This function (used by readers) performs strict validation on the lpos - * values to possibly detect bugs in the writer code. A WARN_ON_ONCE() is - * triggered if an internal error is detected. - */ -static const char *get_data(struct prb_data_ring *data_ring, - struct prb_data_blk_lpos *blk_lpos, - unsigned int *data_size) -{ - struct prb_data_block *db; - - /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { - if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { - *data_size = 0; - return ""; - } - return NULL; - } - - /* Regular data block: @begin less than @next and in same wrap. */ - if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next) && - blk_lpos->begin < blk_lpos->next) { - db = to_block(data_ring, blk_lpos->begin); - *data_size = blk_lpos->next - blk_lpos->begin; - - /* Wrapping data block: @begin is one wrap behind @next. */ - } else if (DATA_WRAPS(data_ring, blk_lpos->begin + DATA_SIZE(data_ring)) == - DATA_WRAPS(data_ring, blk_lpos->next)) { - db = to_block(data_ring, 0); - *da
[PATCH v2 7/7][next] scripts/gdb: support printk finalized records
With commit ("printk: ringbuffer: add finalization/extension support") a new state bit for finalized records was added. This not only changed the bit representation of committed records, but also reduced the size for record IDs. Update the gdb scripts to correctly interpret the state variable. Signed-off-by: John Ogness --- Documentation/admin-guide/kdump/gdbmacros.txt | 10 +++--- scripts/gdb/linux/dmesg.py| 10 ++ 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/kdump/gdbmacros.txt b/Documentation/admin-guide/kdump/gdbmacros.txt index 6025534c6c14..1ccc811c82ad 100644 --- a/Documentation/admin-guide/kdump/gdbmacros.txt +++ b/Documentation/admin-guide/kdump/gdbmacros.txt @@ -295,8 +295,11 @@ document dump_record end define dmesg - set var $desc_committed = 1UL << ((sizeof(long) * 8) - 1) - set var $flags_mask = 3UL << ((sizeof(long) * 8) - 2) + # definitions from kernel/printk/printk_ringbuffer.h + set var $desc_commit = 1UL << ((sizeof(long) * 8) - 1) + set var $desc_final = 1UL << ((sizeof(long) * 8) - 2) + set var $desc_reuse = 1UL << ((sizeof(long) * 8) - 3) + set var $flags_mask = $desc_commit | $desc_final | $desc_reuse set var $id_mask = ~$flags_mask set var $desc_count = 1U << prb->desc_ring.count_bits @@ -309,7 +312,8 @@ define dmesg set var $desc = &prb->desc_ring.descs[$id % $desc_count] # skip non-committed record - if (($desc->state_var.counter & $flags_mask) == $desc_committed) + # (note that commit+!final records will be displayed) + if (($desc->state_var.counter & $desc_commit) == $desc_commit) dump_record $desc $prev_flags set var $prev_flags = $desc->info.flags end diff --git a/scripts/gdb/linux/dmesg.py b/scripts/gdb/linux/dmesg.py index 6c6022012ea8..367523c5c270 100644 --- a/scripts/gdb/linux/dmesg.py +++ b/scripts/gdb/linux/dmesg.py @@ -79,9 +79,10 @@ class LxDmesg(gdb.Command): # definitions from kernel/printk/printk_ringbuffer.h desc_sv_bits = utils.get_long_type().sizeof * 8 -desc_committed_mask = 1 << (desc_sv_bits - 1) -desc_reuse_mask = 1 << (desc_sv_bits - 2) -desc_flags_mask = desc_committed_mask | desc_reuse_mask +desc_commit_mask = 1 << (desc_sv_bits - 1) +desc_final_mask = 1 << (desc_sv_bits - 2) +desc_reuse_mask = 1 << (desc_sv_bits - 3) +desc_flags_mask = desc_commit_mask | desc_final_mask | desc_reuse_mask desc_id_mask = ~desc_flags_mask # read in tail and head descriptor ids @@ -96,8 +97,9 @@ class LxDmesg(gdb.Command): desc_off = desc_sz * ind # skip non-committed record +# (note that commit+!final records will be displayed) state = utils.read_u64(descs, desc_off + sv_off + counter_off) & desc_flags_mask -if state != desc_committed_mask: +if state & desc_commit_mask != desc_commit_mask: if did == head_id: break did = (did + 1) & desc_id_mask -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 6/7][next] printk: reimplement log_cont using record extension
Use the record extending feature of the ringbuffer to implement continuous messages. This preserves the existing continuous message behavior. Signed-off-by: John Ogness --- kernel/printk/printk.c | 96 +- 1 file changed, 19 insertions(+), 77 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index e063edd8adc2..80afee3cfec7 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -532,7 +532,10 @@ static int log_store(u32 caller_id, int facility, int level, r.info->caller_id = caller_id; /* insert message */ - prb_commit_finalize(&e); + if ((flags & LOG_CONT) || !(flags & LOG_NEWLINE)) + prb_commit(&e); + else + prb_commit_finalize(&e); return (text_len + trunc_msg_len); } @@ -1888,87 +1891,26 @@ static inline u32 printk_caller_id(void) 0x8000 + raw_smp_processor_id(); } -/* - * Continuation lines are buffered, and not committed to the record buffer - * until the line is complete, or a race forces it. The line fragments - * though, are printed immediately to the consoles to ensure everything has - * reached the console in case of a kernel crash. - */ -static struct cont { - char buf[LOG_LINE_MAX]; - size_t len; /* length == 0 means unused buffer */ - u32 caller_id; /* printk_caller_id() of first print */ - u64 ts_nsec;/* time of first print */ - u8 level; /* log level of first message */ - u8 facility;/* log facility of first message */ - enum log_flags flags; /* prefix, newline flags */ -} cont; - -static void cont_flush(void) -{ - if (cont.len == 0) - return; - - log_store(cont.caller_id, cont.facility, cont.level, cont.flags, - cont.ts_nsec, NULL, 0, cont.buf, cont.len); - cont.len = 0; -} - -static bool cont_add(u32 caller_id, int facility, int level, -enum log_flags flags, const char *text, size_t len) -{ - /* If the line gets too long, split it up in separate records. */ - if (cont.len + len > sizeof(cont.buf)) { - cont_flush(); - return false; - } - - if (!cont.len) { - cont.facility = facility; - cont.level = level; - cont.caller_id = caller_id; - cont.ts_nsec = local_clock(); - cont.flags = flags; - } - - memcpy(cont.buf + cont.len, text, len); - cont.len += len; - - // The original flags come from the first line, - // but later continuations can add a newline. - if (flags & LOG_NEWLINE) { - cont.flags |= LOG_NEWLINE; - cont_flush(); - } - - return true; -} - static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len) { const u32 caller_id = printk_caller_id(); - /* -* If an earlier line was buffered, and we're a continuation -* write from the same context, try to add it to the buffer. -*/ - if (cont.len) { - if (cont.caller_id == caller_id && (lflags & LOG_CONT)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - /* Otherwise, make sure it's flushed */ - cont_flush(); - } - - /* Skip empty continuation lines that couldn't be added - they just flush */ - if (!text_len && (lflags & LOG_CONT)) - return 0; - - /* If it doesn't end in a newline, try to buffer the current line */ - if (!(lflags & LOG_NEWLINE)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) + if (lflags & LOG_CONT) { + struct prb_reserved_entry e; + struct printk_record r; + + prb_rec_init_wr(&r, text_len, 0); + if (prb_reserve_last(&e, prb, &r, caller_id)) { + memcpy(&r.text_buf[r.info->text_len], text, text_len); + r.info->text_len += text_len; + if (lflags & LOG_NEWLINE) { + r.info->flags |= LOG_NEWLINE; + prb_commit_finalize(&e); + } else { + prb_commit(&e); + } return text_len; + } } /* Store it in the record log */ -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 0/7][next] printk: reimplement LOG_CONT handling
Hello, Here is v2 for the second series to rework the printk subsystem. (The v1 is here [0].) This series implements a new ringbuffer feature that allows the last record to be extended. Petr Mladek provided the initial proof of concept [1] for this. Using the record extension feature, LOG_CONT is re-implemented in a way that exactly preserves its behavior, but avoids the need for an extra buffer. In particular, it avoids the need for any synchronization that such a buffer requires. This series deviates from the agreements [2] made at the meeting during LPC2019 in Lisbon. The test results of the v1 series showed that the effects on existing userspace tools using /dev/kmsg (journalctl, dmesg) were not acceptable [3]. That is why a new decision [4] was made to preserve the current LOG_CONT behavior. Patch 5 introduces two new memory barriers. However, both are alternate path memory barriers. They exactly match the purpose and context of the two existing memory barriers that they provide an alternate path for. For this reason, I do not believe that a new memory barrier review is necessary. Nevertheless, I have included the memory barrier experts CC. Patch 6 assumes that the gdb script series [5] for the new printk ringbuffer has been applied. John Ogness [0] https://lkml.kernel.org/r/20200717234818.8622-1-john.ogn...@linutronix.de [1] https://lkml.kernel.org/r/20200812163908.GH12903@alley [2] https://lkml.kernel.org/r/87k1acz5rx@linutronix.de [3] https://lkml.kernel.org/r/20200811160551.GC12903@alley [4] https://lkml.kernel.org/r/CAHk-=wj_b6Bh=d-wwh0xyqoqbhhkyeexhszkpxdra6gjtvk...@mail.gmail.com [5] https://lkml.kernel.org/r/20200814212525.6118-1-john.ogn...@linutronix.de John Ogness (7): printk: ringbuffer: rename DESC_COMMITTED_MASK flag printk: ringbuffer: change representation of reusable printk: ringbuffer: relocate get_data() printk: ringbuffer: add BLK_DATALESS() macro printk: ringbuffer: add finalization/extension support printk: reimplement log_cont using record extension scripts/gdb: support printk finalized records Documentation/admin-guide/kdump/gdbmacros.txt | 10 +- kernel/printk/printk.c| 98 +--- kernel/printk/printk_ringbuffer.c | 496 +++--- kernel/printk/printk_ringbuffer.h | 12 +- scripts/gdb/linux/dmesg.py| 10 +- 5 files changed, 453 insertions(+), 173 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 1/7][next] printk: ringbuffer: rename DESC_COMMITTED_MASK flag
The flag DESC_COMMITTED_MASK has a much longer name compared to the other state flags and also is in past tense form, rather than in command form. Rename the flag to DESC_COMMIT_MASK in order to match the other state flags. Signed-off-by: John Ogness --- kernel/printk/printk_ringbuffer.c | 8 kernel/printk/printk_ringbuffer.h | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 0659b50872b5..76248c82d557 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -361,7 +361,7 @@ static enum desc_state get_desc_state(unsigned long id, if (state_val & DESC_REUSE_MASK) return desc_reusable; - if (state_val & DESC_COMMITTED_MASK) + if (state_val & DESC_COMMIT_MASK) return desc_committed; return desc_reserved; @@ -462,7 +462,7 @@ static enum desc_state desc_read(struct prb_desc_ring *desc_ring, static void desc_make_reusable(struct prb_desc_ring *desc_ring, unsigned long id) { - unsigned long val_committed = id | DESC_COMMITTED_MASK; + unsigned long val_committed = id | DESC_COMMIT_MASK; unsigned long val_reusable = val_committed | DESC_REUSE_MASK; struct prb_desc *desc = to_desc(desc_ring, id); atomic_long_t *state_var = &desc->state_var; @@ -899,7 +899,7 @@ static bool desc_reserve(struct printk_ringbuffer *rb, unsigned long *id_out) */ prev_state_val = atomic_long_read(&desc->state_var); /* LMM(desc_reserve:E) */ if (prev_state_val && - prev_state_val != (id_prev_wrap | DESC_COMMITTED_MASK | DESC_REUSE_MASK)) { + prev_state_val != (id_prev_wrap | DESC_COMMIT_MASK | DESC_REUSE_MASK)) { WARN_ON_ONCE(1); return false; } @@ -1184,7 +1184,7 @@ void prb_commit(struct prb_reserved_entry *e) * this. This pairs with desc_read:B. */ if (!atomic_long_try_cmpxchg(&d->state_var, &prev_state_val, -e->id | DESC_COMMITTED_MASK)) { /* LMM(prb_commit:B) */ +e->id | DESC_COMMIT_MASK)) { /* LMM(prb_commit:B) */ WARN_ON_ONCE(1); } diff --git a/kernel/printk/printk_ringbuffer.h b/kernel/printk/printk_ringbuffer.h index e6302da041f9..dcda5e9b4676 100644 --- a/kernel/printk/printk_ringbuffer.h +++ b/kernel/printk/printk_ringbuffer.h @@ -115,9 +115,9 @@ struct prb_reserved_entry { #define _DATA_SIZE(sz_bits)(1UL << (sz_bits)) #define _DESCS_COUNT(ct_bits) (1U << (ct_bits)) #define DESC_SV_BITS (sizeof(unsigned long) * 8) -#define DESC_COMMITTED_MASK(1UL << (DESC_SV_BITS - 1)) +#define DESC_COMMIT_MASK (1UL << (DESC_SV_BITS - 1)) #define DESC_REUSE_MASK(1UL << (DESC_SV_BITS - 2)) -#define DESC_FLAGS_MASK(DESC_COMMITTED_MASK | DESC_REUSE_MASK) +#define DESC_FLAGS_MASK(DESC_COMMIT_MASK | DESC_REUSE_MASK) #define DESC_ID_MASK (~DESC_FLAGS_MASK) #define DESC_ID(sv)((sv) & DESC_ID_MASK) #define FAILED_LPOS0x1 @@ -213,7 +213,7 @@ struct prb_reserved_entry { */ #define BLK0_LPOS(sz_bits) (-(_DATA_SIZE(sz_bits))) #define DESC0_ID(ct_bits) DESC_ID(-(_DESCS_COUNT(ct_bits) + 1)) -#define DESC0_SV(ct_bits) (DESC_COMMITTED_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits)) +#define DESC0_SV(ct_bits) (DESC_COMMIT_MASK | DESC_REUSE_MASK | DESC0_ID(ct_bits)) /* * Define a ringbuffer with an external text data buffer. The same as -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2 4/7][next] printk: ringbuffer: add BLK_DATALESS() macro
Rather than continually needing to explicitly check @begin and @next to identify a dataless block, introduce and use a BLK_DATALESS() macro. Signed-off-by: John Ogness --- kernel/printk/printk_ringbuffer.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 86af38c2cf77..d66718e74aae 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -266,6 +266,8 @@ /* Determine if a logical position refers to a data-less block. */ #define LPOS_DATALESS(lpos)((lpos) & 1UL) +#define BLK_DATALESS(blk) (LPOS_DATALESS((blk)->begin) && \ +LPOS_DATALESS((blk)->next)) /* Get the logical position at index 0 of the current wrap. */ #define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \ @@ -1021,7 +1023,7 @@ static unsigned int space_used(struct prb_data_ring *data_ring, struct prb_data_blk_lpos *blk_lpos) { /* Data-less blocks take no space. */ - if (LPOS_DATALESS(blk_lpos->begin)) + if (BLK_DATALESS(blk_lpos)) return 0; if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) { @@ -1054,7 +1056,7 @@ static const char *get_data(struct prb_data_ring *data_ring, struct prb_data_block *db; /* Data-less data block description. */ - if (LPOS_DATALESS(blk_lpos->begin) && LPOS_DATALESS(blk_lpos->next)) { + if (BLK_DATALESS(blk_lpos)) { if (blk_lpos->begin == NO_LPOS && blk_lpos->next == NO_LPOS) { *data_size = 0; return ""; -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH][next] docs: vmcoreinfo: add lockless printk ringbuffer vmcoreinfo
With the introduction of the lockless printk ringbuffer, the VMCOREINFO relating to the kernel log buffer was changed. Update the documentation to match those changes. Fixes: ("printk: use the lockless ringbuffer") Signed-off-by: John Ogness Reported-by: Nick Desaulniers --- based on next-20200814 .../admin-guide/kdump/vmcoreinfo.rst | 131 ++ 1 file changed, 102 insertions(+), 29 deletions(-) diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 2baad0bfb09d..eb116905c31c 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -189,50 +189,123 @@ from this. Free areas descriptor. User-space tools use this value to iterate the free_area ranges. MAX_ORDER is used by the zone buddy allocator. -log_first_idx +prb +--- + +A pointer to the printk ringbuffer (struct printk_ringbuffer). This +may be pointing to the static boot ringbuffer or the dynamically +allocated ringbuffer, depending on when the the core dump occurred. +Used by user-space tools to read the active kernel log buffer. + +printk_rb_static + + +A pointer to the static boot printk ringbuffer. If @prb has a +different value, this is useful for viewing the initial boot messages, +which may have been overwritten in the dynamically allocated +ringbuffer. + +clear_seq +- + +The sequence number of the printk() record after the last clear +command. It indicates the first record after the last +SYSLOG_ACTION_CLEAR, like issued by 'dmesg -c'. Used by user-space +tools to dump a subset of the dmesg log. + +printk_ringbuffer +- + +The size of a printk_ringbuffer structure. This structure contains all +information required for accessing the various components of the +kernel log buffer. + +(printk_ringbuffer, desc_ring|text_data_ring|dict_data_ring|fail) +- + +Offsets for the various components of the printk ringbuffer. Used by +user-space tools to view the kernel log buffer without requiring the +declaration of the structure. + +prb_desc_ring - -Index of the first record stored in the buffer log_buf. Used by -user-space tools to read the strings in the log_buf. +The size of the prb_desc_ring structure. This structure contains +information about the set of record descriptors. -log_buf +(prb_desc_ring, count_bits|descs|head_id|tail_id) +- + +Offsets for the fields describing the set of record descriptors. Used +by user-space tools to be able to traverse the descriptors without +requiring the declaration of the structure. + +prb_desc + + +The size of the prb_desc structure. This structure contains +information about a single record descriptor. + +(prb_desc, info|state_var|text_blk_lpos|dict_blk_lpos) +-- + +Offsets for the fields describing a record descriptors. Used by +user-space tools to be able to read descriptors without requiring +the declaration of the structure. + +prb_data_blk_lpos +- + +The size of the prb_data_blk_lpos structure. This structure contains +information about where the text or dictionary data (data block) is +located within the respective data ring. + +(prb_data_blk_lpos, begin|next) +--- -Console output is written to the ring buffer log_buf at index -log_first_idx. Used to get the kernel log. +Offsets for the fields describing the location of a data block. Used +by user-space tools to be able to locate data blocks without +requiring the declaration of the structure. -log_buf_len +printk_info --- -log_buf's length. +The size of the printk_info structure. This structure contains all +the meta-data for a record. -clear_idx -- +(printk_info, seq|ts_nsec|text_len|dict_len|caller_id) +-- -The index that the next printk() record to read after the last clear -command. It indicates the first record after the last SYSLOG_ACTION -_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump -the dmesg log. +Offsets for the fields providing the meta-data for a record. Used by +user-space tools to be able to read the information without requiring +the declaration of the structure. -log_next_idx - +prb_data_ring +- -The index of the next record to store in the buffer log_buf. Used to -compute the index of the current buffer position. +The size of the prb_data_ring structure. This structure contains +information about a set of data blocks. -printk_log --- +(prb_data_ring, size_bits|data|head_lpos|tail_lpos) +--- -The size of a structure printk_log. Used to compute the size of -messages, and extract dmesg log. It
Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On 2020-08-14, Sergey Senozhatsky wrote: > One thing that we need to handle here, I believe, is that the context > which crashes the kernel should flush its cont buffer, because the > information there is relevant to the crash: > > pr_cont_alloc_info(&c); > pr_cont(&c, "1"); > pr_cont(&c, "2"); > >> > oops > panic() > << > pr_cont_flush(&c); > > We better flush that context's pr_cont buffer during panic(). I am not convinced of the general usefulness of partial messages, but as long as we have an API that includes registration, usage, and deregistration of some sort of handle, then we leave the window open for such implementations. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On 2020-08-13, Petr Mladek wrote: > On Thu 2020-08-13 09:50:25, John Ogness wrote: >> On 2020-08-13, Sergey Senozhatsky wrote: >> > This is not an unseen pattern, I'm afraid. And the problem here can >> > be more general: >> > >> >pr_info("text"); >> >pr_cont("1"); >> >exception/IRQ/NMI >> >pr_alert("text\n"); >> >pr_cont("2"); >> >pr_cont("\n"); >> > >> > I guess the solution would be to store "last log_level" in task_struct >> > and get current (new) timestamp for broken cont line? >> >> (Warning: new ideas ahead) >> >> The fundamental problem is that there is no real association between >> the cont parts. So any interruption results in a broken record. If we >> really want to do this correctly, we need real association. I believe I failed to recognize the fundamental problem. The fundamental problem is that the pr_cont() semantics are very poor. I now strongly believe that we need to fix those semantics by having the pr_cont() user take responsibility for buffering the message. Patching the ~2000 pr_cont() users will be far easier than continuing to twist ourselves around this madness. Here is an example for a new pr_cont() API: struct pr_cont c; pr_cont_alloc_info(&c); (or alternatively) dev_cont_alloc_info(dev, &c); pr_cont(&c, "1"); pr_cont(&c, "2"); pr_cont_flush(&c); Using macro magic, there can be the usual dbg, warn, err, etc. variants of the alloc functions. The alloc function would need to work for any context, but that would not be an issue. If the cont message started to get too large, pr_cont() could do its own flushing in between, while still holding on to the context information. If for some reason the alloc function could not allocate a buffer, all the pr_cont() calls could fallback to logging the individual cont parts. I believe this would solve all cont-related problems while also allowing the new ringbuffer to remain as it already is in linux-next. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On 2020-08-13, Sergey Senozhatsky wrote: > This is not an unseen pattern, I'm afraid. And the problem here can > be more general: > > pr_info("text"); > pr_cont("1"); > exception/IRQ/NMI > pr_alert("text\n"); > pr_cont("2"); > pr_cont("\n"); > > I guess the solution would be to store "last log_level" in task_struct > and get current (new) timestamp for broken cont line? (Warning: new ideas ahead) The fundamental problem is that there is no real association between the cont parts. So any interruption results in a broken record. If we really want to do this correctly, we need real association. With the new finalize flag for records, I thought about perhaps adding support for chaining data blocks. A data block currently stores an unsigned long for the ID of the associated descriptor. But it could optionally include a second unsigned long, which is the lpos of the next text part. All the data blocks of a chain would point back to the same descriptor. The descriptor would only point to the first data block of the chain and include a flag that it is using chained data blocks. Then we would only need to track the sequence number of the open record and new data blocks could be added to the data block chain of the correct record. Readers cannot see the record until it is finalized. Also, since only finalized records can be invalidated, there are no races of chains becoming invalidated while being appended. My concerns about this idea: - What if the printk user does not correctly terminate the cont message? There is no mechanism to allow that open record to be force-finalized so that readers can read newer records. - For tasks, the sequence number of the open record could be stored on the task_struct. For non-tasks, we could use a global per-cpu variable where each CPU stores 2 sequence numbers: the sequence number of the open record for the non-task and the sequence number of the open record for an interrupting NMI. Is that sufficient? John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: POC: Alternative solution: Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On 2020-08-12, Petr Mladek wrote: > So, I have one crazy idea to add one more state bit so that we > could have: > > + committed: set when the data are written into the data ring. > + final: set when the data block could not longer get reopened > + reuse: set when the desctiptor/data block could get reused > > "final" bit will define when the descriptor could not longer > get reopened (cleared committed bit) and the data block could > not get extended. I had not thought of extending data blocks. That is clever! I implemented this solution for myself and am currently running more tests. Some things that I changed from your suggestion: 1. I created a separate prb_reserve_cont() function. The reason for this is because the caller needs to understand what is happening. The caller is getting an existing record with existing data and must append new data. The @text_len field of the info reports how long the existing data is. So the LOG_CONT handling code in printk.c looks something like this: if (lflags & LOG_CONT) { struct prb_reserved_entry e; struct printk_record r; prb_rec_init_wr(&r, text_len, 0); if (prb_reserve_cont(&e, prb, &r, caller_id)) { memcpy(&r.text_buf[r.info->text_len], text, text_len); r.info->text_len += text_len; if (lflags & LOG_NEWLINE) r.info->flags |= LOG_NEWLINE; if (r.info->flags & LOG_NEWLINE) prb_commit_finalize(&e); else prb_commit(&e); return text_len; } } This seemed simpler than trying to extend prb_reserve() to secretly support LOG_CONT records. 2. I haven't yet figured out how to preserve calling context when a newline appears. For example: pr_info("text"); pr_cont(" 1"); pr_cont(" 2\n"); pr_cont("3"); pr_cont(" 4\n"); For "3" the calling context (info, timestamp) is lost because with "2" the record is finalized. Perhaps the above is invalid usage of LOG_CONT? 3. There are some memory barriers introduced, but it looks like it shouldn't add too much complexity. I will continue to refine my working version and post a patch so that we have something to work with. This looks to be the most promising way forward. Thanks. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH v5 4/4] printk: use the lockless ringbuffer
On 2020-08-11, Nick Desaulniers wrote: > From what I can tell, I think this patch ("printk: use the lockless > ringbuffer") breaks lx-dmesg in CONFIG_GDB_SCRIPTS. > > (gdb) lx-dmesg > Python Exception No symbol "log_first_idx" in specified > context.: > Error occurred in Python: No symbol "log_first_idx" in specified context. > > This command is used to dump the printk log buffer. > > It looks like the only places left in the kernel that reference are: > > - Documentation/admin-guide/kdump/gdbmacros.txt > - Documentation/admin-guide/kdump/vmcoreinfo.rst > - scripts/gdb/linux/dmesg.py > > I believe this commit removed log_first_idx, so all of the above > probably need to be fixed up, too. Thanks for pointing this out! I will get to work on a patch for this. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 2/4] printk: store instead of processing cont parts
On 2020-07-21, Sergey Senozhatsky wrote: >> That said, we have traditionally used not just "current process", but >> also "last irq-level" as the context information, so I do think it >> would be good to continue to do that. > > OK, so basically, extending printk_caller_id() so that for IRQ/NMI > we will have more info than just "0x8000 + raw_smp_processor_id()". If bit31 is set, the upper 8 bits could specify what the lower 24 bits represent. That would give some freedom for the future. For example: 0x80 = cpu id (generic context) 0x81 = interrupt number 0x82 = cpu id (nmi context) Or maybe ascii should be used instead? 0x80 | '\0' = cpu id (generic context) 0x80 | 'i' = interrupt number 0x80 | 'n' = cpu id (nmi context) Just an idea. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v2][next] printk: ringbuffer: support dataless records
With commit ("printk: use the lockless ringbuffer"), printk() started silently dropping messages without text because such records are not supported by the new printk ringbuffer. Add support for such records. Currently dataless records are denoted by INVALID_LPOS in order to recognize failed prb_reserve() calls. Change the ringbuffer to instead use two different identifiers (FAILED_LPOS and NO_LPOS) to distinguish between failed prb_reserve() records and successful dataless records, respectively. Fixes: ("printk: use the lockless ringbuffer") Fixes: https://lkml.kernel.org/r/20200718121053.ga691...@elver.google.com Reported-by: Marco Elver Signed-off-by: John Ogness --- based on next-20200721 chages since v1: - Instead of handling empty text messages as special case errors, allow such messages to be handled as any other valid messages. This also allows the empty text message to be counted as a line. kernel/printk/printk_ringbuffer.c | 72 +++ kernel/printk/printk_ringbuffer.h | 15 --- 2 files changed, 43 insertions(+), 44 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 7355ca99e852..0659b50872b5 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -264,6 +264,9 @@ /* Determine how many times the data array has wrapped. */ #define DATA_WRAPS(data_ring, lpos)((lpos) >> (data_ring)->size_bits) +/* Determine if a logical position refers to a data-less block. */ +#define LPOS_DATALESS(lpos)((lpos) & 1UL) + /* Get the logical position at index 0 of the current wrap. */ #define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \ ((lpos) & ~DATA_SIZE_MASK(data_ring)) @@ -320,21 +323,13 @@ static unsigned int to_blk_size(unsigned int size) * block does not exceed the maximum possible size that could fit within the * ringbuffer. This function provides that basic size check so that the * assumption is safe. - * - * Writers are also not allowed to write 0-sized (data-less) records. Such - * records are used only internally by the ringbuffer. */ static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size) { struct prb_data_block *db = NULL; - /* -* Writers are not allowed to write data-less records. Such records -* are used only internally by the ringbuffer to denote records where -* their data failed to allocate or have been lost. -*/ if (size == 0) - return false; + return true; /* * Ensure the alignment padded size could possibly fit in the data @@ -568,8 +563,8 @@ static bool data_push_tail(struct printk_ringbuffer *rb, unsigned long tail_lpos; unsigned long next_lpos; - /* If @lpos is not valid, there is nothing to do. */ - if (lpos == INVALID_LPOS) + /* If @lpos is from a data-less block, there is nothing to do. */ + if (LPOS_DATALESS(lpos)) return true; /* @@ -962,8 +957,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, if (size == 0) { /* Specify a data-less block. */ - blk_lpos->begin = INVALID_LPOS; - blk_lpos->next = INVALID_LPOS; + blk_lpos->begin = NO_LPOS; + blk_lpos->next = NO_LPOS; return NULL; } @@ -976,8 +971,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, if (!data_push_tail(rb, data_ring, next_lpos - DATA_SIZE(data_ring))) { /* Failed to allocate, specify a data-less block. */ - blk_lpos->begin = INVALID_LPOS; - blk_lpos->next = INVALID_LPOS; + blk_lpos->begin = FAILED_LPOS; + blk_lpos->next = FAILED_LPOS; return NULL; } @@ -1025,6 +1020,10 @@ static char *data_alloc(struct printk_ringbuffer *rb, static unsigned int space_used(struct prb_data_ring *data_ring, struct prb_data_blk_lpos *blk_lpos) { + /* Data-less blocks take no space. */ + if (LPOS_DATALESS(blk_lpos->begin)) + return 0; + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) { /* Data block does not wrap. */ return (DATA_INDEX(data_ring, blk_lpos->next) - @@ -1080,11 +1079,8 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, if (!data_check_size(&rb->text_data_ring, r->text_buf_size)) goto fail; - /* Records are allowed to not have dictionaries. */ - if (r->dict_buf_size) { - if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size)) - goto fail; -
Re: [PATCH][next] printk: ringbuffer: support dataless records
On 2020-07-21, Sergey Senozhatsky wrote: >> @@ -1402,7 +1396,9 @@ static int prb_read(struct printk_ringbuffer *rb, u64 >> seq, >> /* Copy text data. If it fails, this is a data-less record. */ >> if (!copy_data(&rb->text_data_ring, &desc.text_blk_lpos, >> desc.info.text_len, >> r->text_buf, r->text_buf_size, line_count)) { >> -return -ENOENT; >> +/* Report an error if there should have been data. */ >> +if (desc.info.text_len != 0) >> +return -ENOENT; >> } > > If this is a dataless record then should copy_data() return error? You are correct. That makes more sense. I will send a v2. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 1/4] printk: ringbuffer: support dataless records
On 2020-07-18, John Ogness wrote: > In order to support storage of continuous lines, dataless records must > be allowed. For example, these are generated with the legal calls: > > pr_info(""); > pr_cont("\n"); > > Currently dataless records are denoted by INVALID_LPOS in order to > recognize failed prb_reserve() calls. Change the code to use two > different identifiers (FAILED_LPOS and NO_LPOS) to distinguish > between failed prb_reserve() records and successful dataless records. This patch has been re-posted [0] as a regression fix for the first series that is already in linux-next. Only the commit message has been changed to reflect the regression fix rather than preparing for continuous line support. Assuming that patch is accepted, this one should be dropped. John Ogness [0] https://lkml.kernel.org/r/20200720140111.19935-1-john.ogn...@linutronix.de ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH][next] printk: ringbuffer: support dataless records
With commit ("printk: use the lockless ringbuffer"), printk() started silently dropping messages without text because such records are not supported by the new printk ringbuffer. Add support for such records. Currently dataless records are denoted by INVALID_LPOS in order to recognize failed prb_reserve() calls. Change the ringbuffer to instead use two different identifiers (FAILED_LPOS and NO_LPOS) to distinguish between failed prb_reserve() records and successful dataless records, respectively. Fixes: ("printk: use the lockless ringbuffer") Fixes: https://lkml.kernel.org/r/20200718121053.ga691...@elver.google.com Signed-off-by: John Ogness --- based on next-20200720 kernel/printk/printk_ringbuffer.c | 58 ++- kernel/printk/printk_ringbuffer.h | 15 2 files changed, 35 insertions(+), 38 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 7355ca99e852..54b0a6324dbf 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -264,6 +264,9 @@ /* Determine how many times the data array has wrapped. */ #define DATA_WRAPS(data_ring, lpos)((lpos) >> (data_ring)->size_bits) +/* Determine if a logical position refers to a data-less block. */ +#define LPOS_DATALESS(lpos)((lpos) & 1UL) + /* Get the logical position at index 0 of the current wrap. */ #define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \ ((lpos) & ~DATA_SIZE_MASK(data_ring)) @@ -320,21 +323,13 @@ static unsigned int to_blk_size(unsigned int size) * block does not exceed the maximum possible size that could fit within the * ringbuffer. This function provides that basic size check so that the * assumption is safe. - * - * Writers are also not allowed to write 0-sized (data-less) records. Such - * records are used only internally by the ringbuffer. */ static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size) { struct prb_data_block *db = NULL; - /* -* Writers are not allowed to write data-less records. Such records -* are used only internally by the ringbuffer to denote records where -* their data failed to allocate or have been lost. -*/ if (size == 0) - return false; + return true; /* * Ensure the alignment padded size could possibly fit in the data @@ -568,8 +563,8 @@ static bool data_push_tail(struct printk_ringbuffer *rb, unsigned long tail_lpos; unsigned long next_lpos; - /* If @lpos is not valid, there is nothing to do. */ - if (lpos == INVALID_LPOS) + /* If @lpos is from a data-less block, there is nothing to do. */ + if (LPOS_DATALESS(lpos)) return true; /* @@ -962,8 +957,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, if (size == 0) { /* Specify a data-less block. */ - blk_lpos->begin = INVALID_LPOS; - blk_lpos->next = INVALID_LPOS; + blk_lpos->begin = NO_LPOS; + blk_lpos->next = NO_LPOS; return NULL; } @@ -976,8 +971,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, if (!data_push_tail(rb, data_ring, next_lpos - DATA_SIZE(data_ring))) { /* Failed to allocate, specify a data-less block. */ - blk_lpos->begin = INVALID_LPOS; - blk_lpos->next = INVALID_LPOS; + blk_lpos->begin = FAILED_LPOS; + blk_lpos->next = FAILED_LPOS; return NULL; } @@ -1025,6 +1020,10 @@ static char *data_alloc(struct printk_ringbuffer *rb, static unsigned int space_used(struct prb_data_ring *data_ring, struct prb_data_blk_lpos *blk_lpos) { + /* Data-less blocks take no space. */ + if (LPOS_DATALESS(blk_lpos->begin)) + return 0; + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) { /* Data block does not wrap. */ return (DATA_INDEX(data_ring, blk_lpos->next) - @@ -1080,11 +1079,8 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, if (!data_check_size(&rb->text_data_ring, r->text_buf_size)) goto fail; - /* Records are allowed to not have dictionaries. */ - if (r->dict_buf_size) { - if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size)) - goto fail; - } + if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size)) + goto fail; /* * Descriptors in the reserved state act as blockers to all further @@ -1212,10 +1208,8 @@ static char *get_data(struct
Re: [PATCH v5 4/4] printk: use the lockless ringbuffer
On 2020-07-18, Marco Elver wrote: > It seems this causes a regression observed at least with newline-only > printks. > [...] > -- >8 -- > > --- a/init/main.c > +++ b/init/main.c > @@ -1039,6 +1039,10 @@ asmlinkage __visible void __init start_kernel(void) > sfi_init_late(); > kcsan_init(); > > + pr_info("EXPECT BLANK LINE --vv\n"); > + pr_info("\n"); > + pr_info("EXPECT BLANK LINE --^^\n"); > + > /* Do the rest non-__init'ed, we're now alive */ > arch_call_rest_init(); Thanks for the example. This is an unintentional regression in the series. I will submit a patch to fix this. Note that this regression does not exist when the followup series [0] (reimplementing LOG_CONT) is applied. All the more reason that the 1st series should be fixed before pushing the 2nd series to linux-next. John Ogness [0] https://lkml.kernel.org/r/20200717234818.8622-1-john.ogn...@linutronix.de ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH 0/4] printk: reimplement LOG_CONT handling
On 2020-07-17, Linus Torvalds wrote: > Make sure you test the case of "fast concurrent readers". The last > time we did things like this, it was a disaster, because a concurrent > reader would see and return the _incomplete_ line, and the next entry > was still being generated on another CPU. > > The reader would then decide to return that incomplete line, because > it had something. > > And while in theory this could then be handled properly in user space, > in practice it wasn't. So you'd see a lot of logging tools that would > then report all those continuations as separate log events. > > Which is the whole point of LOG_CONT - for that *not* to happen. I expect this is handled correctly since the reader is not given any parts until a full line is ready, but I will put more focus on testing this to make sure. Thanks for the regression and testing tips. > So this is just a heads-up that I will not pull something that breaks > LOG_CONT because it thinks "user space can handle it". No. User space > does not handle it, and we need to handle it for the user. Understood. Petr and Sergey are also strict about this. We are making a serious effort to avoid breaking things for userspace. John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 1/4] printk: ringbuffer: support dataless records
In order to support storage of continuous lines, dataless records must be allowed. For example, these are generated with the legal calls: pr_info(""); pr_cont("\n"); Currently dataless records are denoted by INVALID_LPOS in order to recognize failed prb_reserve() calls. Change the code to use two different identifiers (FAILED_LPOS and NO_LPOS) to distinguish between failed prb_reserve() records and successful dataless records. Signed-off-by: John Ogness --- kernel/printk/printk_ringbuffer.c | 58 ++- kernel/printk/printk_ringbuffer.h | 15 2 files changed, 35 insertions(+), 38 deletions(-) diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c index 7355ca99e852..54b0a6324dbf 100644 --- a/kernel/printk/printk_ringbuffer.c +++ b/kernel/printk/printk_ringbuffer.c @@ -264,6 +264,9 @@ /* Determine how many times the data array has wrapped. */ #define DATA_WRAPS(data_ring, lpos)((lpos) >> (data_ring)->size_bits) +/* Determine if a logical position refers to a data-less block. */ +#define LPOS_DATALESS(lpos)((lpos) & 1UL) + /* Get the logical position at index 0 of the current wrap. */ #define DATA_THIS_WRAP_START_LPOS(data_ring, lpos) \ ((lpos) & ~DATA_SIZE_MASK(data_ring)) @@ -320,21 +323,13 @@ static unsigned int to_blk_size(unsigned int size) * block does not exceed the maximum possible size that could fit within the * ringbuffer. This function provides that basic size check so that the * assumption is safe. - * - * Writers are also not allowed to write 0-sized (data-less) records. Such - * records are used only internally by the ringbuffer. */ static bool data_check_size(struct prb_data_ring *data_ring, unsigned int size) { struct prb_data_block *db = NULL; - /* -* Writers are not allowed to write data-less records. Such records -* are used only internally by the ringbuffer to denote records where -* their data failed to allocate or have been lost. -*/ if (size == 0) - return false; + return true; /* * Ensure the alignment padded size could possibly fit in the data @@ -568,8 +563,8 @@ static bool data_push_tail(struct printk_ringbuffer *rb, unsigned long tail_lpos; unsigned long next_lpos; - /* If @lpos is not valid, there is nothing to do. */ - if (lpos == INVALID_LPOS) + /* If @lpos is from a data-less block, there is nothing to do. */ + if (LPOS_DATALESS(lpos)) return true; /* @@ -962,8 +957,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, if (size == 0) { /* Specify a data-less block. */ - blk_lpos->begin = INVALID_LPOS; - blk_lpos->next = INVALID_LPOS; + blk_lpos->begin = NO_LPOS; + blk_lpos->next = NO_LPOS; return NULL; } @@ -976,8 +971,8 @@ static char *data_alloc(struct printk_ringbuffer *rb, if (!data_push_tail(rb, data_ring, next_lpos - DATA_SIZE(data_ring))) { /* Failed to allocate, specify a data-less block. */ - blk_lpos->begin = INVALID_LPOS; - blk_lpos->next = INVALID_LPOS; + blk_lpos->begin = FAILED_LPOS; + blk_lpos->next = FAILED_LPOS; return NULL; } @@ -1025,6 +1020,10 @@ static char *data_alloc(struct printk_ringbuffer *rb, static unsigned int space_used(struct prb_data_ring *data_ring, struct prb_data_blk_lpos *blk_lpos) { + /* Data-less blocks take no space. */ + if (LPOS_DATALESS(blk_lpos->begin)) + return 0; + if (DATA_WRAPS(data_ring, blk_lpos->begin) == DATA_WRAPS(data_ring, blk_lpos->next)) { /* Data block does not wrap. */ return (DATA_INDEX(data_ring, blk_lpos->next) - @@ -1080,11 +1079,8 @@ bool prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb, if (!data_check_size(&rb->text_data_ring, r->text_buf_size)) goto fail; - /* Records are allowed to not have dictionaries. */ - if (r->dict_buf_size) { - if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size)) - goto fail; - } + if (!data_check_size(&rb->dict_data_ring, r->dict_buf_size)) + goto fail; /* * Descriptors in the reserved state act as blockers to all further @@ -1212,10 +1208,8 @@ static char *get_data(struct prb_data_ring *data_ring, struct prb_data_block *db; /* Data-less data block description. */ - if (blk_lpos->begin == INVALID_LPOS && - blk_lpos->next ==
[PATCH 2/4] printk: store instead of processing cont parts
Instead of buffering continuous line parts before storing the full line into the ringbuffer, store each part as its own record. Signed-off-by: John Ogness --- kernel/printk/printk.c | 114 - 1 file changed, 11 insertions(+), 103 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index fec71229169e..c4274c867771 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -495,9 +495,14 @@ static void truncate_msg(u16 *text_len, u16 *trunc_msg_len) *trunc_msg_len = 0; } +static inline u32 printk_caller_id(void) +{ + return in_task() ? task_pid_nr(current) : + 0x8000 + raw_smp_processor_id(); +} + /* insert record into the buffer, discard old ones, update heads */ -static int log_store(u32 caller_id, int facility, int level, -enum log_flags flags, u64 ts_nsec, +static int log_store(int facility, int level, enum log_flags flags, const char *dict, u16 dict_len, const char *text, u16 text_len) { @@ -525,11 +530,8 @@ static int log_store(u32 caller_id, int facility, int level, r.info->facility = facility; r.info->level = level & 7; r.info->flags = flags & 0x1f; - if (ts_nsec > 0) - r.info->ts_nsec = ts_nsec; - else - r.info->ts_nsec = local_clock(); - r.info->caller_id = caller_id; + r.info->ts_nsec = local_clock(); + r.info->caller_id = printk_caller_id(); /* insert message */ prb_commit(&e); @@ -1874,100 +1876,6 @@ static inline void printk_delay(void) } } -static inline u32 printk_caller_id(void) -{ - return in_task() ? task_pid_nr(current) : - 0x8000 + raw_smp_processor_id(); -} - -/* - * Continuation lines are buffered, and not committed to the record buffer - * until the line is complete, or a race forces it. The line fragments - * though, are printed immediately to the consoles to ensure everything has - * reached the console in case of a kernel crash. - */ -static struct cont { - char buf[LOG_LINE_MAX]; - size_t len; /* length == 0 means unused buffer */ - u32 caller_id; /* printk_caller_id() of first print */ - u64 ts_nsec;/* time of first print */ - u8 level; /* log level of first message */ - u8 facility;/* log facility of first message */ - enum log_flags flags; /* prefix, newline flags */ -} cont; - -static void cont_flush(void) -{ - if (cont.len == 0) - return; - - log_store(cont.caller_id, cont.facility, cont.level, cont.flags, - cont.ts_nsec, NULL, 0, cont.buf, cont.len); - cont.len = 0; -} - -static bool cont_add(u32 caller_id, int facility, int level, -enum log_flags flags, const char *text, size_t len) -{ - /* If the line gets too long, split it up in separate records. */ - if (cont.len + len > sizeof(cont.buf)) { - cont_flush(); - return false; - } - - if (!cont.len) { - cont.facility = facility; - cont.level = level; - cont.caller_id = caller_id; - cont.ts_nsec = local_clock(); - cont.flags = flags; - } - - memcpy(cont.buf + cont.len, text, len); - cont.len += len; - - // The original flags come from the first line, - // but later continuations can add a newline. - if (flags & LOG_NEWLINE) { - cont.flags |= LOG_NEWLINE; - cont_flush(); - } - - return true; -} - -static size_t log_output(int facility, int level, enum log_flags lflags, const char *dict, size_t dictlen, char *text, size_t text_len) -{ - const u32 caller_id = printk_caller_id(); - - /* -* If an earlier line was buffered, and we're a continuation -* write from the same context, try to add it to the buffer. -*/ - if (cont.len) { - if (cont.caller_id == caller_id && (lflags & LOG_CONT)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - /* Otherwise, make sure it's flushed */ - cont_flush(); - } - - /* Skip empty continuation lines that couldn't be added - they just flush */ - if (!text_len && (lflags & LOG_CONT)) - return 0; - - /* If it doesn't end in a newline, try to buffer the current line */ - if (!(lflags & LOG_NEWLINE)) { - if (cont_add(caller_id, facility, level, lflags, text, text_len)) - return text_len; - } - - /
[PATCH 3/4] printk: process cont records during reading
Readers of the printk ringbuffer can use the continuous line interface to read full lines. The interface buffers continuous line parts until the full line is available or that line was interrupted by a writer from another context. The continuous line interface automatically throws out partial lines if a reader jumps to older sequence numbers. If a reader jumps to higher sequence numbers, any cached partial lines are flushed. The continuous line interface is used by: - console printing - syslog - devkmsg devkmsg has the additional requirement that it must show a line for every sequence number if the corresponding continuous line record was not dropped. The continuous line interface supports this by allowing the reader to provide a printk_record struct that will be filled in with placeholder information (but no text) in case a full line is not yet available. Note that kmsg_dump does not use the continuous line interface. The continuous line interface discards dictionaries of continuous lines. Signed-off-by: John Ogness --- kernel/printk/printk.c | 455 + 1 file changed, 371 insertions(+), 84 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index c4274c867771..363ef290f313 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -657,6 +657,287 @@ static ssize_t msg_print_ext_body(char *buf, size_t size, return p - buf; } +/* + * Readers of the printk ringbuffer can use the continuous line interface + * to read full lines. The interface buffers continuous line parts until + * the full line is available or that line was interrupted by a writer + * from another context. + * + * The continuous line interface automatically throws out partial lines if a + * reader jumps to older sequence numbers. If a reader jumps to higher + * sequence numbers, any cached partial lines are flushed. + * + * The continuous line interface is used by: + * + * - console printing + * - syslog + * - devkmsg + * + * devkmsg has the additional requirement that it must show a line for every + * sequence number if the corresponding continuous line record was not dropped. + * The continuous line interface supports this by allowing the reader to + * provide a printk_record struct that will be filled in with placeholder + * information (but no text) in case a full line is not yet available. + * + * Note that kmsg_dump does not use the continuous line interface. + * + * The continuous line interface discards dictionaries of continuous lines. + */ + +struct cont_record { + struct printk_recordr; + struct printk_info info; + chartext[LOG_LINE_MAX + PREFIX_MAX]; + boolset; +}; + +/* + * The continuous line buffer manager. + * + * @cr:record buffers for reading and caching continuous lines + * @dict: the dictionary used when reading a record + * @cache_ind: index of the cache record in @cr + * @begin_seq: the minimal sequence number of the current continuous line + * @end_seq: the maximal sequence number of the current continuous line + * @dropped: count of dropped records during the current continuous line + */ +struct cont { + struct cont_record cr[2]; + chardict[LOG_LINE_MAX]; + int cache_ind; + u64 begin_seq; + u64 end_seq; + unsigned long dropped; +}; + +/* + * Initialize the continuous line manager. As an alternative, it is also + * acceptable if the structure is set to all zeros. + */ +static void cont_init(struct cont *c, u64 seq) +{ + c->cr[0].set = false; + c->cr[1].set = false; + c->cache_ind = 0; + c->begin_seq = seq; + c->end_seq = seq; + c->dropped = 0; +} + +/* Get the continuous line cache, if one exists. */ +static struct printk_record *cont_cache(struct cont *c) +{ + struct cont_record *cr = &c->cr[c->cache_ind]; + + if (!cr->set) + return NULL; + return &cr->r; +} + +/* + * Like cont_cache(), but also flushes the dropped count, clears the + * dictionary, and switches to the other record buffer for future caching. + */ +static struct printk_record *cont_flush(struct cont *c, unsigned long *dropped) +{ + struct cont_record *cr = &c->cr[c->cache_ind]; + + c->cache_ind ^= 1; + + if (!cr->set) + return NULL; + + if (dropped) + *dropped = c->dropped; + c->dropped = 0; + + c->begin_seq = cr->info.seq; + cr->info.dict_len = 0; + cr->set = false; + + return &cr->r; +} + +/* + * Wrapper for prb_read_valid() that reads a new record into the + * non-caching record buffer. + */ +static struct printk_record *cont_read(struct cont *c, u64 seq) +{ + struct cont_record *cr = &c->cr[c-
[PATCH 0/4] printk: reimplement LOG_CONT handling
Hello, Here is the second series to rework the printk subsystem. This series removes LOG_CONT handling from printk() callers, storing all LOG_CONT parts individually in the ringbuffer. With this series, LOG_CONT handling is moved to the ringbuffer readers that provide the record contents to users (console printing, syslog, /dev/kmsg). This change is necessary in order to support the upcoming move to a fully lockless printk() implementation. This series is in line with the agreements [0] made at the meeting during LPC2019 in Lisbon, with 1 exception: For the /dev/kmsg interface, empty line placeholder records are reported for the LOG_CONT parts. Using placeholders avoids tools such as systemd-journald from erroneously reporting missed messages. However, it also means that empty placeholder records are visible in systemd-journald logs and displayed in tools such as dmesg. The effect can be easily observed with the sysrq help: $ echo h | sudo tee /proc/sysrq-trigger $ sudo dmesg | tail -n 30 $ sudo journalctl -k -n 30 Providing the placeholder entries allows a userspace tool to identify if records were actually lost. IMHO this an important feature. Its side effect can be addressed by userspace tools if they change to silently consume empty records. For dump tools that process the ringbuffer directly (such as crash, makedumpfile, kexec-tools), they will need to implement LOG_CONT handling if they want to present clean continuous line messages. Finally, by moving LOG_CONT handling from writers to readers, some incorrect pr_cont() usage is revealed. Patch 4 of this series addresses one such example. This series is based on the printk git tree [1] printk-rework branch. [0] https://lkml.kernel.org/r/87k1acz5rx@linutronix.de [1] https://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git (printk-rework branch) John Ogness (4): printk: ringbuffer: support dataless records printk: store instead of processing cont parts printk: process cont records during reading ipconfig: cleanup printk usage kernel/printk/printk.c| 569 -- kernel/printk/printk_ringbuffer.c | 58 ++- kernel/printk/printk_ringbuffer.h | 15 +- net/ipv4/ipconfig.c | 25 +- 4 files changed, 434 insertions(+), 233 deletions(-) -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH 4/4] ipconfig: cleanup printk usage
The use of pr_info() and pr_cont() was not ordered correctly for all cases. Order it so that all cases provide the expected output. Signed-off-by: John Ogness --- net/ipv4/ipconfig.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index 561f15b5a944..0f4bd7a59310 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -1442,6 +1442,9 @@ static int __init ip_auto_config(void) #endif int err; unsigned int i; +#ifndef IPCONFIG_SILENT + bool pr0; +#endif /* Initialise all name servers and NTP servers to NONE (but only if the * "ip=" or "nfsaddrs=" kernel command line parameters weren't decoded, @@ -1575,31 +1578,37 @@ static int __init ip_auto_config(void) if (ic_dev_mtu) pr_cont(", mtu=%d", ic_dev_mtu); /* Name servers (if any): */ + pr0 = false; for (i = 0; i < CONF_NAMESERVERS_MAX; i++) { if (ic_nameservers[i] != NONE) { - if (i == 0) + if (!pr0) { pr_info(" nameserver%u=%pI4", i, &ic_nameservers[i]); - else + pr0 = true; + } else { pr_cont(", nameserver%u=%pI4", i, &ic_nameservers[i]); + } } - if (i + 1 == CONF_NAMESERVERS_MAX) - pr_cont("\n"); } + if (pr0) + pr_cont("\n"); /* NTP servers (if any): */ + pr0 = false; for (i = 0; i < CONF_NTP_SERVERS_MAX; i++) { if (ic_ntp_servers[i] != NONE) { - if (i == 0) + if (!pr0) { pr_info(" ntpserver%u=%pI4", i, &ic_ntp_servers[i]); - else + pr0 = true; + } else { pr_cont(", ntpserver%u=%pI4", i, &ic_ntp_servers[i]); + } } - if (i + 1 == CONF_NTP_SERVERS_MAX) - pr_cont("\n"); } + if (pr0) + pr_cont("\n"); #endif /* !SILENT */ /* -- 2.20.1 ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
[PATCH v4 4/4] printk: use the lockless ringbuffer
Replace the existing ringbuffer usage and implementation with lockless ringbuffer usage. Even though the new ringbuffer does not require locking, all existing locking is left in place. Therefore, this change is purely replacing the underlining ringbuffer. Changes that exist due to the ringbuffer replacement: - The VMCOREINFO has been updated for the new structures. - Dictionary data is now stored in a separate data buffer from the human-readable messages. The dictionary data buffer is set to the same size as the message buffer. Therefore, the total required memory for both dictionary and message data is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and 2 * log_buf_len (the kernel parameter) for the dynamic buffers. - Record meta-data is now stored in a separate array of descriptors. This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes for the static array and 72 * (log_buf_len >> 5) bytes for the dynamic array. Signed-off-by: John Ogness --- kernel/printk/printk.c | 940 + 1 file changed, 493 insertions(+), 447 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 1b41e1b98221..4c6b4e68ad07 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -55,6 +55,7 @@ #define CREATE_TRACE_POINTS #include +#include "printk_ringbuffer.h" #include "console_cmdline.h" #include "braille.h" #include "internal.h" @@ -294,30 +295,24 @@ enum con_msg_format_flags { static int console_msg_format = MSG_FORMAT_DEFAULT; /* - * The printk log buffer consists of a chain of concatenated variable - * length records. Every record starts with a record header, containing - * the overall length of the record. + * The printk log buffer consists of a sequenced collection of records, each + * containing variable length message and dictionary text. Every record + * also contains its own meta-data (@info). * - * The heads to the first and last entry in the buffer, as well as the - * sequence numbers of these entries are maintained when messages are - * stored. + * Every record meta-data carries the timestamp in microseconds, as well as + * the standard userspace syslog level and syslog facility. The usual kernel + * messages use LOG_KERN; userspace-injected messages always carry a matching + * syslog facility, by default LOG_USER. The origin of every message can be + * reliably determined that way. * - * If the heads indicate available messages, the length in the header - * tells the start next message. A length == 0 for the next message - * indicates a wrap-around to the beginning of the buffer. + * The human readable log message of a record is available in @text, the + * length of the message text in @text_len. The stored message is not + * terminated. * - * Every record carries the monotonic timestamp in microseconds, as well as - * the standard userspace syslog level and syslog facility. The usual - * kernel messages use LOG_KERN; userspace-injected messages always carry - * a matching syslog facility, by default LOG_USER. The origin of every - * message can be reliably determined that way. - * - * The human readable log message directly follows the message header. The - * length of the message text is stored in the header, the stored message - * is not terminated. - * - * Optionally, a message can carry a dictionary of properties (key/value pairs), - * to provide userspace with a machine-readable message context. + * Optionally, a record can carry a dictionary of properties (key/value + * pairs), to provide userspace with a machine-readable message context. The + * length of the dictionary is available in @dict_len. The dictionary is not + * terminated. * * Examples for well-defined, commonly used property names are: * DEVICE=b12:8 device identifier @@ -331,21 +326,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT; * follows directly after a '=' character. Every property is terminated by * a '\0' character. The last property is not terminated. * - * Example of a message structure: - * ff 8f 00 00 00 00 00 00 monotonic time in nsec - * 0008 34 00record is 52 bytes long - * 000a0b 00 text is 11 bytes long - * 000c 1f 00dictionary is 23 bytes long - * 000e03 00 LOG_KERN (facility) LOG_ERR (level) - * 0010 69 74 27 73 20 61 20 6c "it's a l" - * 69 6e 65 "ine" - * 001b 44 45 56 49 43 "DEVIC" - * 45 3d 62 38 3a 32 00 44 "E=b8:2\0D" - * 52 49 56 45 52 3d 62 75 "RIVER=bu" - * 67 "g" - * 0032 00 00 00 padding to next message header - * - * The 'struct printk_log
[PATCH v4 2/4] printk: add lockless ringbuffer
Introduce a multi-reader multi-writer lockless ringbuffer for storing the kernel log messages. Readers and writers may use their API from any context (including scheduler and NMI). This ringbuffer will make it possible to decouple printk() callers from any context, locking, or console constraints. It also makes it possible for readers to have full access to the ringbuffer contents at any time and context (for example from any panic situation). The printk_ringbuffer is made up of 3 internal ringbuffers: desc_ring: A ring of descriptors. A descriptor contains all record meta data (sequence number, timestamp, loglevel, etc.) as well as internal state information about the record and logical positions specifying where in the other ringbuffers the text and dictionary strings are located. text_data_ring: A ring of data blocks. A data block consists of an unsigned long integer (ID) that maps to a desc_ring index followed by the text string of the record. dict_data_ring: A ring of data blocks. A data block consists of an unsigned long integer (ID) that maps to a desc_ring index followed by the dictionary string of the record. The internal state information of a descriptor is the key element to allow readers and writers to locklessly synchronize access to the data. Signed-off-by: John Ogness Co-developed-by: Petr Mladek Reviewed-by: Petr Mladek Reviewed-by: Paul E. McKenney --- kernel/printk/Makefile|1 + kernel/printk/printk_ringbuffer.c | 1676 + kernel/printk/printk_ringbuffer.h | 399 +++ 3 files changed, 2076 insertions(+) create mode 100644 kernel/printk/printk_ringbuffer.c create mode 100644 kernel/printk/printk_ringbuffer.h diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile index 4d052fc6bcde..eee3dc9b60a9 100644 --- a/kernel/printk/Makefile +++ b/kernel/printk/Makefile @@ -2,3 +2,4 @@ obj-y = printk.o obj-$(CONFIG_PRINTK) += printk_safe.o obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o +obj-$(CONFIG_PRINTK) += printk_ringbuffer.o diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c new file mode 100644 index ..f4a670f7289d --- /dev/null +++ b/kernel/printk/printk_ringbuffer.c @@ -0,0 +1,1676 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include "printk_ringbuffer.h" + +/** + * DOC: printk_ringbuffer overview + * + * Data Structure + * -- + * The printk_ringbuffer is made up of 3 internal ringbuffers: + * + * desc_ring + * A ring of descriptors. A descriptor contains all record meta data + * (sequence number, timestamp, loglevel, etc.) as well as internal state + * information about the record and logical positions specifying where in + * the other ringbuffers the text and dictionary strings are located. + * + * text_data_ring + * A ring of data blocks. A data block consists of an unsigned long + * integer (ID) that maps to a desc_ring index followed by the text + * string of the record. + * + * dict_data_ring + * A ring of data blocks. A data block consists of an unsigned long + * integer (ID) that maps to a desc_ring index followed by the dictionary + * string of the record. + * + * The internal state information of a descriptor is the key element to allow + * readers and writers to locklessly synchronize access to the data. + * + * Implementation + * -- + * + * Descriptor Ring + * ~~~ + * The descriptor ring is an array of descriptors. A descriptor contains all + * the meta data of a printk record as well as blk_lpos structs pointing to + * associated text and dictionary data blocks (see "Data Rings" below). Each + * descriptor is assigned an ID that maps directly to index values of the + * descriptor array and has a state. The ID and the state are bitwise combined + * into a single descriptor field named @state_var, allowing ID and state to + * be synchronously and atomically updated. + * + * Descriptors have three states: + * + * reserved + * A writer is modifying the record. + * + * committed + * The record and all its data are complete and available for reading. + * + * reusable + * The record exists, but its text and/or dictionary data may no longer + * be available. + * + * Querying the @state_var of a record requires providing the ID of the + * descriptor to query. This can yield a possible fourth (pseudo) state: + * + * miss + * The descriptor being queried has an unexpected ID. + * + * The descriptor ring has a @tail_id that contains the ID of the oldest + * descriptor and @head_id that contains the ID of the newest descriptor. + * + * When a new descriptor should be created (and the ring is full), the tail + * descriptor is invalidated by first transitioning to the reusable state and + * then invalidating all tail data blocks up to and including the data blocks + * associated with the
[PATCH v5 2/4] printk: add lockless ringbuffer
Introduce a multi-reader multi-writer lockless ringbuffer for storing the kernel log messages. Readers and writers may use their API from any context (including scheduler and NMI). This ringbuffer will make it possible to decouple printk() callers from any context, locking, or console constraints. It also makes it possible for readers to have full access to the ringbuffer contents at any time and context (for example from any panic situation). The printk_ringbuffer is made up of 3 internal ringbuffers: desc_ring: A ring of descriptors. A descriptor contains all record meta data (sequence number, timestamp, loglevel, etc.) as well as internal state information about the record and logical positions specifying where in the other ringbuffers the text and dictionary strings are located. text_data_ring: A ring of data blocks. A data block consists of an unsigned long integer (ID) that maps to a desc_ring index followed by the text string of the record. dict_data_ring: A ring of data blocks. A data block consists of an unsigned long integer (ID) that maps to a desc_ring index followed by the dictionary string of the record. The internal state information of a descriptor is the key element to allow readers and writers to locklessly synchronize access to the data. Signed-off-by: John Ogness Co-developed-by: Petr Mladek Reviewed-by: Petr Mladek Reviewed-by: Paul E. McKenney --- kernel/printk/Makefile|1 + kernel/printk/printk_ringbuffer.c | 1687 + kernel/printk/printk_ringbuffer.h | 399 +++ 3 files changed, 2087 insertions(+) create mode 100644 kernel/printk/printk_ringbuffer.c create mode 100644 kernel/printk/printk_ringbuffer.h diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile index 4d052fc6bcde..eee3dc9b60a9 100644 --- a/kernel/printk/Makefile +++ b/kernel/printk/Makefile @@ -2,3 +2,4 @@ obj-y = printk.o obj-$(CONFIG_PRINTK) += printk_safe.o obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o +obj-$(CONFIG_PRINTK) += printk_ringbuffer.o diff --git a/kernel/printk/printk_ringbuffer.c b/kernel/printk/printk_ringbuffer.c new file mode 100644 index ..7355ca99e852 --- /dev/null +++ b/kernel/printk/printk_ringbuffer.c @@ -0,0 +1,1687 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include "printk_ringbuffer.h" + +/** + * DOC: printk_ringbuffer overview + * + * Data Structure + * -- + * The printk_ringbuffer is made up of 3 internal ringbuffers: + * + * desc_ring + * A ring of descriptors. A descriptor contains all record meta data + * (sequence number, timestamp, loglevel, etc.) as well as internal state + * information about the record and logical positions specifying where in + * the other ringbuffers the text and dictionary strings are located. + * + * text_data_ring + * A ring of data blocks. A data block consists of an unsigned long + * integer (ID) that maps to a desc_ring index followed by the text + * string of the record. + * + * dict_data_ring + * A ring of data blocks. A data block consists of an unsigned long + * integer (ID) that maps to a desc_ring index followed by the dictionary + * string of the record. + * + * The internal state information of a descriptor is the key element to allow + * readers and writers to locklessly synchronize access to the data. + * + * Implementation + * -- + * + * Descriptor Ring + * ~~~ + * The descriptor ring is an array of descriptors. A descriptor contains all + * the meta data of a printk record as well as blk_lpos structs pointing to + * associated text and dictionary data blocks (see "Data Rings" below). Each + * descriptor is assigned an ID that maps directly to index values of the + * descriptor array and has a state. The ID and the state are bitwise combined + * into a single descriptor field named @state_var, allowing ID and state to + * be synchronously and atomically updated. + * + * Descriptors have three states: + * + * reserved + * A writer is modifying the record. + * + * committed + * The record and all its data are complete and available for reading. + * + * reusable + * The record exists, but its text and/or dictionary data may no longer + * be available. + * + * Querying the @state_var of a record requires providing the ID of the + * descriptor to query. This can yield a possible fourth (pseudo) state: + * + * miss + * The descriptor being queried has an unexpected ID. + * + * The descriptor ring has a @tail_id that contains the ID of the oldest + * descriptor and @head_id that contains the ID of the newest descriptor. + * + * When a new descriptor should be created (and the ring is full), the tail + * descriptor is invalidated by first transitioning to the reusable state and + * then invalidating all tail data blocks up to and including the data blocks + * associated with the
[PATCH v5 4/4] printk: use the lockless ringbuffer
Replace the existing ringbuffer usage and implementation with lockless ringbuffer usage. Even though the new ringbuffer does not require locking, all existing locking is left in place. Therefore, this change is purely replacing the underlining ringbuffer. Changes that exist due to the ringbuffer replacement: - The VMCOREINFO has been updated for the new structures. - Dictionary data is now stored in a separate data buffer from the human-readable messages. The dictionary data buffer is set to the same size as the message buffer. Therefore, the total required memory for both dictionary and message data is 2 * (2 ^ CONFIG_LOG_BUF_SHIFT) for the initial static buffers and 2 * log_buf_len (the kernel parameter) for the dynamic buffers. - Record meta-data is now stored in a separate array of descriptors. This is an additional 72 * (2 ^ (CONFIG_LOG_BUF_SHIFT - 5)) bytes for the static array and 72 * (log_buf_len >> 5) bytes for the dynamic array. Signed-off-by: John Ogness Reviewed-by: Petr Mladek --- kernel/printk/printk.c | 940 + 1 file changed, 493 insertions(+), 447 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 1b41e1b98221..fec71229169e 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -55,6 +55,7 @@ #define CREATE_TRACE_POINTS #include +#include "printk_ringbuffer.h" #include "console_cmdline.h" #include "braille.h" #include "internal.h" @@ -294,30 +295,24 @@ enum con_msg_format_flags { static int console_msg_format = MSG_FORMAT_DEFAULT; /* - * The printk log buffer consists of a chain of concatenated variable - * length records. Every record starts with a record header, containing - * the overall length of the record. + * The printk log buffer consists of a sequenced collection of records, each + * containing variable length message and dictionary text. Every record + * also contains its own meta-data (@info). * - * The heads to the first and last entry in the buffer, as well as the - * sequence numbers of these entries are maintained when messages are - * stored. + * Every record meta-data carries the timestamp in microseconds, as well as + * the standard userspace syslog level and syslog facility. The usual kernel + * messages use LOG_KERN; userspace-injected messages always carry a matching + * syslog facility, by default LOG_USER. The origin of every message can be + * reliably determined that way. * - * If the heads indicate available messages, the length in the header - * tells the start next message. A length == 0 for the next message - * indicates a wrap-around to the beginning of the buffer. + * The human readable log message of a record is available in @text, the + * length of the message text in @text_len. The stored message is not + * terminated. * - * Every record carries the monotonic timestamp in microseconds, as well as - * the standard userspace syslog level and syslog facility. The usual - * kernel messages use LOG_KERN; userspace-injected messages always carry - * a matching syslog facility, by default LOG_USER. The origin of every - * message can be reliably determined that way. - * - * The human readable log message directly follows the message header. The - * length of the message text is stored in the header, the stored message - * is not terminated. - * - * Optionally, a message can carry a dictionary of properties (key/value pairs), - * to provide userspace with a machine-readable message context. + * Optionally, a record can carry a dictionary of properties (key/value + * pairs), to provide userspace with a machine-readable message context. The + * length of the dictionary is available in @dict_len. The dictionary is not + * terminated. * * Examples for well-defined, commonly used property names are: * DEVICE=b12:8 device identifier @@ -331,21 +326,19 @@ static int console_msg_format = MSG_FORMAT_DEFAULT; * follows directly after a '=' character. Every property is terminated by * a '\0' character. The last property is not terminated. * - * Example of a message structure: - * ff 8f 00 00 00 00 00 00 monotonic time in nsec - * 0008 34 00record is 52 bytes long - * 000a0b 00 text is 11 bytes long - * 000c 1f 00dictionary is 23 bytes long - * 000e03 00 LOG_KERN (facility) LOG_ERR (level) - * 0010 69 74 27 73 20 61 20 6c "it's a l" - * 69 6e 65 "ine" - * 001b 44 45 56 49 43 "DEVIC" - * 45 3d 62 38 3a 32 00 44 "E=b8:2\0D" - * 52 49 56 45 52 3d 62 75 "RIVER=bu" - * 67 "g" - * 0032 00 00 00 padding to next message header - * - * The 'stru
Re: [PATCH v4 0/4] printk: replace ringbuffer
On 2020-07-10, Petr Mladek wrote: >> The next series in the printk-rework (move LOG_CONT handling from >> writers to readers) makes some further changes that, while not >> incompatible, could affect the output of existing tools. It may be a >> good idea to let the new ringbuffer sit in linux-next until the next >> series has been discussed/reviewed/merged. After the next series, >> everything will be in place (with regard to userspace tools) to >> finish the rework. > > I know that it might be premature question. But I wonder what kind > of changes are expected because of the continuous lines. I will be posting the next series quite soon, so I think it will be better to discuss it when we have a working example in front of us. > Do you expect some changes in the ring buffer structures so that > the debugging tools would need yet another update to actually > access the data? The next series will be modifying the ringbuffer to allow data-less records. This is necessary to support the thousands of pr_cont("\n"); calls in the kernel code. Failed dataring allocations will still be detected because the message flags for those records will be 0. For the above pr_cont() line, they will be LOG_NEWLINE|LOG_CONT. Since the dump tools need to make changes for the new ringbuffer anyway, I think it would be good to hammer out the accepted LOG_CONT implementation first, just in case we do need to make any subtle internal changes. > Or do you expect backward compatible changes that would allow > to pass related parts of the continuous lines via syslog/dev_kmsg > interface and join them later in userspace? For users of console, non-extended netconsole, syslog, and kmsg_dump, there will be no external changes whatsoever. These interfaces have no awareness of sequence numbers, which will allow the kernel to re-assemble the LOG_CONT messages for them. Users of /dev/kmsg and extended netconsole see sequence numbers. Offlist we discussed various hacks how to get around this without causing errors for existing software, but it was all ugly. IMHO users of these sequence number interfaces need to see all the records individually and reassemble the LOG_CONT messages themselves if they want to. I believe that is the only sane path forward. To do this, the caller id will no longer be optional to the sequence number output since that is vital information to re-assemble the LOG_CONT messages. Keep in mind that current software already needs to be able to handle the caller id being shown. Also, currently in mainline there is no guarantee that LOG_CONT messages are contiguous. So current software must also be ready to accept broken up LOG_CONT messages. This is why I think it would be acceptable to make this change for /dev/kmsg and extended netconsole. But I understand it is controversial since tools like systemd and dmesg use /dev/kmsg. Until they are modified to re-assemble LOG_CONT messages, they will present the user with the ugliness of LOG_CONT pieces (always, rather than as is now rarely). John Ogness ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec