Hi YAMAZAKI, On Sat, Jul 12, 2025 at 12:08 AM YAMAZAKI MASAMITSU(山崎 真光) <yamazaki-m...@nec.com> wrote: > > Sorry, I'm so rate.
No worries :) > > I looked into the fix and I think it will work safely on other > architectures as well. I think it will also solve the problem > with ppc64. I accept and merge this patch. > > Thank you for reporting this problem and providing the very > difficult fix. Thanks for your response and merging! Thanks, Tao Liu > > Thanks, > > Masa > > On 2025/07/10 14:34, Tao Liu wrote: > > Kindly ping... > > > > Sorry to interrupt, could you please merge the patch since there are > > few bugs which depend on the backporting of this patch? > > > > Thanks, > > Tao Liu > > > > > > On Fri, Jul 4, 2025 at 7:51 PM Tao Liu <l...@redhat.com> wrote: > >> On Fri, Jul 4, 2025 at 6:49 PM HAGIO KAZUHITO(萩尾 一仁) <k-hagio...@nec.com> > >> wrote: > >>> On 2025/07/04 7:35, Tao Liu wrote: > >>>> Hi Petr, > >>>> > >>>> On Fri, Jul 4, 2025 at 2:31 AM Petr Tesarik <ptesa...@suse.com> wrote: > >>>>> On Tue, 1 Jul 2025 19:59:53 +1200 > >>>>> Tao Liu <l...@redhat.com> wrote: > >>>>> > >>>>>> Hi Kazu, > >>>>>> > >>>>>> Thanks for your comments! > >>>>>> > >>>>>> On Tue, Jul 1, 2025 at 7:38 PM HAGIO KAZUHITO(萩尾 一仁) > >>>>>> <k-hagio...@nec.com> wrote: > >>>>>>> Hi Tao, > >>>>>>> > >>>>>>> thank you for the patch. > >>>>>>> > >>>>>>> On 2025/06/25 11:23, Tao Liu wrote: > >>>>>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can > >>>>>>>> be > >>>>>>>> reproduced with upstream makedumpfile. > >>>>>>>> > >>>>>>>> When analyzing the corrupt vmcore using crash, the following error > >>>>>>>> message will output: > >>>>>>>> > >>>>>>>> crash: compressed kdump: uncompress failed: 0 > >>>>>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 > >>>>>>>> type: > >>>>>>>> "hardirq thread_union" > >>>>>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > >>>>>>>> crash: compressed kdump: uncompress failed: 0 > >>>>>>>> > >>>>>>>> If the vmcore is generated without num-threads option, then no such > >>>>>>>> errors are noticed. > >>>>>>>> > >>>>>>>> With --num-threads=N enabled, there will be N sub-threads created. > >>>>>>>> All > >>>>>>>> sub-threads are producers which responsible for mm page processing, > >>>>>>>> e.g. > >>>>>>>> compression. The main thread is the consumer which responsible for > >>>>>>>> writing the compressed data into file. page_flag_buf->ready is used > >>>>>>>> to > >>>>>>>> sync main and sub-threads. When a sub-thread finishes page > >>>>>>>> processing, > >>>>>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread > >>>>>>>> looply check all threads of the ready flags, and break the loop when > >>>>>>>> find FLAG_READY. > >>>>>>> I've tried to reproduce the issue, but I couldn't on x86_64. > >>>>>> Yes, I cannot reproduce it on x86_64 either, but the issue is very > >>>>>> easily reproduced on ppc64 arch, which is where our QE reported. > >>>>> Yes, this is expected. X86 implements a strongly ordered memory model, > >>>>> so a "store-to-memory" instruction ensures that the new value is > >>>>> immediately observed by other CPUs. > >>>>> > >>>>> FWIW the current code is wrong even on X86, because it does nothing to > >>>>> prevent compiler optimizations. The compiler is then allowed to reorder > >>>>> instructions so that the write to page_flag_buf->ready happens after > >>>>> other writes; with a bit of bad scheduling luck, the consumer thread > >>>>> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). > >>>>> Note that thanks to how compilers are designed (today), this issue is > >>>>> more or less hypothetical. Nevertheless, the use of atomics fixes it, > >>>>> because they also serve as memory barriers. > >>> Thank you Petr, for the information. I was wondering whether atomic > >>> operations might be necessary for the other members of page_flag_buf, > >>> but it looks like they won't be necessary in this case. > >>> > >>> Then I was convinced that the issue would be fixed by removing the > >>> inconsistency of page_flag_buf->ready. And the patch tested ok, so ack. > >>> > >> Thank you all for the patch review, patch testing and comments, these > >> have been so helpful! > >> > >> Thanks, > >> Tao Liu > >> > >>> Thanks, > >>> Kazu > >>> > >>>> Thanks a lot for your detailed explanation, it's very helpful! I > >>>> haven't thought of the possibility of instruction reordering and > >>>> atomic_rw prevents the reorder. > >>>> > >>>> Thanks, > >>>> Tao Liu > >>>> > >>>>> Petr T > >>>>>