Bug#1006157: /lib/modules/5.16.0-1-sparc64-smp/kernel/fs/ext4/ext4.ko: [sparc64+ext4] reads see zeros w/ simultaneous write

2022-02-23 Thread Salvatore Bonaccorso
Control: tags -1 + upstream
Control: forwarded -1 https://marc.info/?l=linux-sparc=164539269632667=2

Hi Noah,

On Tue, Feb 22, 2022 at 07:12:14PM -0800, Noah Misch wrote:
> On Sun, Feb 20, 2022 at 03:31:27PM +0100, Salvatore Bonaccorso wrote:
> > Unless mistaken this looks like to be an upstream issue, think would
> > be better suited to directly report it upstream. Can you do so and
> > keep us in the loop?
> 
> https://marc.info/?t=16453926991 has my upstream report.  Anatoly Pugachev
> confirmed the behavior on sparc64 5.17.0-rc5, so I'm assuming this is not
> Debian-specific.  I will update this bug with any major news.

Many thanks.

Regards,
Salvatore



Bug#1006157: /lib/modules/5.16.0-1-sparc64-smp/kernel/fs/ext4/ext4.ko: [sparc64+ext4] reads see zeros w/ simultaneous write

2022-02-22 Thread Noah Misch
On Sun, Feb 20, 2022 at 03:31:27PM +0100, Salvatore Bonaccorso wrote:
> Unless mistaken this looks like to be an upstream issue, think would
> be better suited to directly report it upstream. Can you do so and
> keep us in the loop?

https://marc.info/?t=16453926991 has my upstream report.  Anatoly Pugachev
confirmed the behavior on sparc64 5.17.0-rc5, so I'm assuming this is not
Debian-specific.  I will update this bug with any major news.



Bug#1006157: /lib/modules/5.16.0-1-sparc64-smp/kernel/fs/ext4/ext4.ko: [sparc64+ext4] reads see zeros w/ simultaneous write

2022-02-20 Thread Salvatore Bonaccorso
Control: tags -1 + moreinfo

Hi Noah,

On Sat, Feb 19, 2022 at 05:53:52PM -0800, Noah Misch wrote:
> Package: src:linux
> Version: 5.16.7-2
> Severity: normal
> File: /lib/modules/5.16.0-1-sparc64-smp/kernel/fs/ext4/ext4.ko
> 
> Dear Maintainer,
> 
>* What led up to the situation?
> 
> The context is an ext4 filesystem on a sparc64 host.  I've observed
> this with each of the three sparc64 kernels that I've tested.  Those
> kernels were 5.16.0-1-sparc64-smp (this report), 5.15.0-2-sparc64-smp,
> and 4.9.0-13-sparc64-smp.
> 
>* What exactly did you do (or not do) that was effective (or
>  ineffective)?
> 
> See the included file for a minimal test program.  It creates two
> processes, each of which loops indefinitely.  One opens a file, writes
> 0x1 to a 256-byte region, and closes the file.  The other process
> opens the same file, reads the same region, and prints a message if
> any byte is not 0x1.
> 
> This thread has more discussion and a more-configurable test program:
> https://postgr.es/m/flat/20220116071210.ga735...@rfd.leadboat.com
> 
>* What was the outcome of this action?
> 
> The program prints messages, at least ten per second.  The mismatch
> always appears at an offset divisible by eight.  Some offsets are more
> common than others.  Here's output from 300s of runtime, filtered
> through "sort -nk3 | uniq -c":
> 
>1729 mismatch at 8: got 0, want 1
>1878 mismatch at 16: got 0, want 1
>1030 mismatch at 24: got 0, want 1
>  41 mismatch at 40: got 0, want 1
> 373 mismatch at 48: got 0, want 1
>  24 mismatch at 56: got 0, want 1
> 349 mismatch at 64: got 0, want 1
>   13525 mismatch at 72: got 0, want 1
> 401 mismatch at 80: got 0, want 1
> 365 mismatch at 88: got 0, want 1
>   1 mismatch at 96: got 0, want 1
>  32 mismatch at 104: got 0, want 1
>  34 mismatch at 112: got 0, want 1
>  19 mismatch at 120: got 0, want 1
>  34 mismatch at 128: got 0, want 1
> 253 mismatch at 136: got 0, want 1
> 149 mismatch at 144: got 0, want 1
> 138 mismatch at 152: got 0, want 1
>   1 mismatch at 160: got 0, want 1
>   4 mismatch at 168: got 0, want 1
>   7 mismatch at 176: got 0, want 1
>   4 mismatch at 184: got 0, want 1
>   1 mismatch at 192: got 0, want 1
>  83 mismatch at 200: got 0, want 1
>  58 mismatch at 208: got 0, want 1
>3301 mismatch at 216: got 0, want 1
>   2 mismatch at 232: got 0, want 1
>   1 mismatch at 248: got 0, want 1
> 
> If I run the program atop an xfs filesystem (still with sparc64), it
> prints nothing.  If I run it with x86_64 or powerpc64 (atop ext4), it
> prints nothing.
> 
>* What outcome did you expect instead?
> 
> I expected the program to print nothing, indicating that the reader
> process observes only 0x1 bytes.  That is how x86_64+ext4 behaves.
> 
> POSIX is stricter, requiring read() and write() implementations such
> that "each call shall either see all of the specified effects of the
> other call, or none of them"
> (https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07).
> ext4 does not conform, which may be pragmatic.  However, with x86_64
> and powerpc64, readers see each byte as either its before-write value
> or its after-write value.  They don't see a zero in an offset that
> will have been nonzero both before and after the ongoing write().

Unless mistaken this looks like to be an upstream issue, think would
be better suited to directly report it upstream. Can you do so and
keep us in the loop?

Regards,
Salvatore



Bug#1006157: /lib/modules/5.16.0-1-sparc64-smp/kernel/fs/ext4/ext4.ko: [sparc64+ext4] reads see zeros w/ simultaneous write

2022-02-19 Thread Noah Misch
Package: src:linux
Version: 5.16.7-2
Severity: normal
File: /lib/modules/5.16.0-1-sparc64-smp/kernel/fs/ext4/ext4.ko

Dear Maintainer,

   * What led up to the situation?

The context is an ext4 filesystem on a sparc64 host.  I've observed
this with each of the three sparc64 kernels that I've tested.  Those
kernels were 5.16.0-1-sparc64-smp (this report), 5.15.0-2-sparc64-smp,
and 4.9.0-13-sparc64-smp.

   * What exactly did you do (or not do) that was effective (or
 ineffective)?

See the included file for a minimal test program.  It creates two
processes, each of which loops indefinitely.  One opens a file, writes
0x1 to a 256-byte region, and closes the file.  The other process
opens the same file, reads the same region, and prints a message if
any byte is not 0x1.

This thread has more discussion and a more-configurable test program:
https://postgr.es/m/flat/20220116071210.ga735...@rfd.leadboat.com

   * What was the outcome of this action?

The program prints messages, at least ten per second.  The mismatch
always appears at an offset divisible by eight.  Some offsets are more
common than others.  Here's output from 300s of runtime, filtered
through "sort -nk3 | uniq -c":

   1729 mismatch at 8: got 0, want 1
   1878 mismatch at 16: got 0, want 1
   1030 mismatch at 24: got 0, want 1
 41 mismatch at 40: got 0, want 1
373 mismatch at 48: got 0, want 1
 24 mismatch at 56: got 0, want 1
349 mismatch at 64: got 0, want 1
  13525 mismatch at 72: got 0, want 1
401 mismatch at 80: got 0, want 1
365 mismatch at 88: got 0, want 1
  1 mismatch at 96: got 0, want 1
 32 mismatch at 104: got 0, want 1
 34 mismatch at 112: got 0, want 1
 19 mismatch at 120: got 0, want 1
 34 mismatch at 128: got 0, want 1
253 mismatch at 136: got 0, want 1
149 mismatch at 144: got 0, want 1
138 mismatch at 152: got 0, want 1
  1 mismatch at 160: got 0, want 1
  4 mismatch at 168: got 0, want 1
  7 mismatch at 176: got 0, want 1
  4 mismatch at 184: got 0, want 1
  1 mismatch at 192: got 0, want 1
 83 mismatch at 200: got 0, want 1
 58 mismatch at 208: got 0, want 1
   3301 mismatch at 216: got 0, want 1
  2 mismatch at 232: got 0, want 1
  1 mismatch at 248: got 0, want 1

If I run the program atop an xfs filesystem (still with sparc64), it
prints nothing.  If I run it with x86_64 or powerpc64 (atop ext4), it
prints nothing.

   * What outcome did you expect instead?

I expected the program to print nothing, indicating that the reader
process observes only 0x1 bytes.  That is how x86_64+ext4 behaves.

POSIX is stricter, requiring read() and write() implementations such
that "each call shall either see all of the specified effects of the
other call, or none of them"
(https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07).
ext4 does not conform, which may be pragmatic.  However, with x86_64
and powerpc64, readers see each byte as either its before-write value
or its after-write value.  They don't see a zero in an offset that
will have been nonzero both before and after the ongoing write().


-- Package-specific info:
** Version:
Linux version 5.16.0-1-sparc64-smp (debian-ker...@lists.debian.org) (gcc-11 
(Debian 11.2.0-16) 11.2.0, GNU ld (GNU Binutils for Debian) 2.37.90.20220130) 
#1 SMP Debian 5.16.7-1 (2022-02-06)

** Command line:
BOOT_IMAGE=/vmlinux-5.16.0-1-sparc64-smp root=/dev/mapper/vg1-nroot ro

** Tainted: E (8192)
 * unsigned module was loaded

** Kernel log:
[344103.150402] null-4.exe[3045591]: segfault at 0 ip 01000990 (rpc 
01000984) sp 07feff952831 error 1 in null-4.exe[100+2000]
[344103.533876] null-4.exe[3045722]: segfault at 8 ip 01000990 (rpc 
01000984) sp 07feffa8c841 error 1 in null-4.exe[100+2000]
[344103.911758] null-4.exe[3045896]: segfault at 8 ip 010007e4 (rpc 
010007dc) sp 07feffeec841 error 1 in null-4.exe[100+2000]
[344104.319288] null-4.exe[3046052]: segfault at 8 ip 010007e4 (rpc 
010007dc) sp 07feffa2e841 error 1 in null-4.exe[100+2000]
[344104.703441] null-4.exe[3046206]: segfault at 8 ip 010007c8 (rpc 
010007bc) sp 07feffeb8841 error 1 in null-4.exe[100+2000]
[344105.411714] null-4.exe[3046494]: segfault at 8 ip 010007e4 (rpc 
010007dc) sp 07feff9ec841 error 1 in null-4.exe[100+2000]
[344105.921598] null-4.exe[3046699]: segfault at 8 ip 010007e4 (rpc 
010007dc) sp 07feffd3a841 error 1 in null-4.exe[100+2000]
[344106.302875] null-5.exe[3046860]: segfault at 0 ip 010009b0 (rpc 
010009a4) sp 07feffbc6831 error 1 in null-5.exe[100+2000]
[344107.467462] show_signal_msg: 2 callbacks suppressed
[344107.467472] null-5.exe[3047293]: segfault at 0 ip 010007f0 (rpc 
010007dc) sp 07feff9a8841 error 1 in null-5.exe[100+2000]