On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote: > On 09/15/2017 12:18 AM, Hugo Mills wrote: > > As far as I know, both of these are basically known issues, with no > > good solution, other than not using O_DIRECT. Certainly the first > > issue is one I recognise. The second isn't one I recognise directly, > > but is unsurprising to me. > > > > There have been discussions -- including developers -- on this list > > as recent as a month or so ago. The general outcome seems to be that > > any problems with O_DIRECT are not going to be fixed. > > I missed this thread; could you point it to me ?
No, you didn't miss it -- you were part of it. :) http://www.spinics.net/lists/linux-btrfs/msg68244.html Hugo. > If csum and O_DIRECT are not reliable, why not disallow one of them: i.e > allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support > O_DIRECT at all... > > In fact most of the applications which benefit from O_DIRECT (it comes to me > VM e DB), are the ones which need also nodatasum to have good performance. > > One of the strongest point of BTRFS was the checksums; but these are not > effective when the file is opened with O_DIRECT; worse there are cases where > the file is corrupted and the application got -EIO; not mentioning that the > dmesg is filled by "csum failed ...." > > > > > > Hugo. > > > > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: > >> Hi all, > >> > >> I discovered two bugs when O_DIRECT is used... > >> > >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used > >> > >> Normally BTRFS prevents to access the contents of a corrupted file; > >> however I was able read the content of a corrupted file simply using > >> O_DIRECT > >> > >> # in a new btrfs filesystem, create a file > >> $ sudo mkfs.btrfs -f /dev/sdd5 > >> $ mount /dev/sdd5 t > >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd > >> bs=$((16*1024)) iflag=fullblock count=1024 > >> > >> # corrupt the file > >> $ sudo filefrag -v t/abcd > >> Filesystem type is: 9123683e > >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) > >> ext: logical_offset: physical_offset: length: expected: > >> flags: > >> 0: 0.. 3475: 70656.. 74131: 3476: > >> 1: 3476.. 4095: 74212.. 74831: 620: 74132: > >> last,eof > >> t/abcd: 2 extents found > >> $ sudo umount t > >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 > >> /dev/sdd5 > >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 > >> corrupting 289406976 copy 1 > >> > >> # try to access the file; expected result: -EIO > >> $ sudo mount /dev/sdd5 t > >> $ dd if=t/abcd | hexdump -c | head > >> dd: error reading 't/abcd': Input/output error > >> 0+0 records in > >> 0+0 records out > >> 0 bytes copied, 0.000477413 s, 0.0 kB/s > >> > >> > >> # try to access the file using O_DIRECT; expected result: -EIO, instead > >> the file is accessible > >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head > >> 0000000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 > >> * > >> 0001000 f g a b c e f g a b c e f g a b > >> 0001010 c e f g a b c e f g a b c e f g > >> 0001020 a b c e f g a b c e f g a b c e > >> 0001030 f g a b c e f g a b c e f g a b > >> 0001040 c e f g a b c e f g a b c e f g > >> 0001050 a b c e f g a b c e f g a b c e > >> 0001060 f g a b c e f g a b c e f g a b > >> 0001070 c e f g a b c e f g a b c e f g > >> > >> (dmesg report the checksum mismatch) > >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off > >> 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 > >> > >> Note the first 4k filled by 0x01 !!!!! > >> > >> Conclusion: even if the file is corrupted and normally BTRFS prevent to > >> access it, using O_DIRECT > >> a) no error is returned to the caller > >> b) instead of the page stored on the disk, it is returned a page filled > >> with 0x01 (according also with the function __readpage_endio_check()) > >> > >> > >> 2) The second bug, is a more severe bug. If during a writing of a buffer > >> with O_DIRECT, the buffer is updated at the same time by a second process, > >> the checksum may be incorrect. > >> > >> At the end of the email there is the code which shows the problem: two > >> process share the same memory: the first write it to the disk, the second > >> update the buffer continuously. A third process try to read the file, but > >> it got time to time -EIO > >> > >> If you ran my code in a btrfs filesystem you got a lot of > >> > >> ERROR: read thread; r = 8192, expected = 16384 > >> ERROR: read thread; r = 8192, expected = 16384 > >> ERROR: read thread; e = 5 - Input/output error > >> ERROR: read thread; e = 5 - Input/output error > >> > >> The firsts lines are related to a shorter read (which may happens). The > >> lasts lines are related to a checksum mismatch. The dmesg is filled by > >> lines like > >> [...] > >> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off > >> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1 > >> [...] > >> > >> This is definitely a bug. > >> > >> I think that using O_DIRECT and updating a page at the same time could > >> happen in a VM. In BTRFS this could lead to a wrong checksum. The problem > >> is that if BTRFS detects a checksum error during a reading: > >> a) if O_DIRECT is not used in the read > >> * -EIO is returned > >> Definitely BAD > >> > >> b) if O_DIRECT is used in the read > >> * it doesn't return the error to the caller > >> * it returns a page filled by 0x01 instead of the data from the disk > >> Even worse than a) > >> > >> Note1: even using O_DIRECT with O_SYNC, the problem still persist. > >> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, > >> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is > >> legally. > >> Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS > >> doesn't support it; see https://github.com/zfsonlinux/zfs/issues/224 > >> > >> BR > >> G.Baroncelli > >> > >> ----- cut --- cut --- cut ---- > >> > >> #define _GNU_SOURCE > >> #include <stdio.h> > >> #include <stdlib.h> > >> #include <sys/mman.h> > >> #include <assert.h> > >> #include <errno.h> > >> #include <sys/types.h> > >> #include <sys/stat.h> > >> #include <fcntl.h> > >> #include <string.h> > >> #include <unistd.h> > >> > >> #define FILESIZE (4096*4) > >> > >> int fd; > >> char *buffer = NULL; > >> > >> void read_thread(const char *nf) { > >> > >> void *data = mmap(NULL, FILESIZE, > >> PROT_READ|PROT_WRITE, > >> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > >> > >> assert(data); > >> fprintf(stderr, "read_thread: data = %p\n", data); > >> int rfd; > >> rfd = open(nf, O_RDONLY); > >> > >> for(;;) { > >> ssize_t r = pread(rfd, data, FILESIZE, 0); > >> if (r < 0) { > >> int e = errno; > >> fprintf(stderr, "ERROR: read thread; e = %d - %s\n", > >> e, strerror(e)); > >> > >> } else if (r != FILESIZE) { > >> fprintf(stderr, "ERROR: read thread; r = %ld, expected > >> = %d\n", > >> r, FILESIZE); > >> } > >> } > >> } > >> > >> void write_thread(void) { > >> > >> for(;;) { > >> ssize_t r = pwrite(fd, buffer, FILESIZE, 0); > >> assert(r == FILESIZE); > >> } > >> } > >> > >> void update_thread(void) { > >> > >> for(;;) { > >> int i; > >> for (i = 0 ; i < FILESIZE ; i++) > >> buffer[i] += i+10; > >> } > >> } > >> > >> > >> int main(int argc, char **argv) { > >> > >> if (argc < 2) { > >> fprintf(stderr, "usage: %s <fname>\n", argv[0]); > >> exit(100); > >> } > >> > >> > >> buffer = mmap(NULL, FILESIZE, > >> PROT_READ|PROT_WRITE, > >> MAP_SHARED|MAP_ANONYMOUS, -1, 0); > >> > >> assert(buffer); > >> fprintf(stderr, "main: data = %p\n", buffer); > >> > >> fd = open(argv[1], O_RDWR|O_DIRECT|O_CREAT, 0660); > >> assert(fd>=0); > >> > >> ssize_t r = pwrite(fd, buffer, FILESIZE, 0); > >> assert(r == FILESIZE); > >> > >> pid_t child; > >> > >> child = fork(); > >> assert(child >= 0); > >> if (child == 0) > >> write_thread(); > >> fprintf(stderr, "write_thread pid = %d\n", child); > >> > >> child = fork(); > >> assert(child >= 0); > >> if (child == 0) > >> read_thread(argv[1]); > >> fprintf(stderr, "read_thread pid = %d\n", child); > >> > >> child = fork(); > >> assert(child >= 0); > >> if (child == 0) > >> update_thread(); > >> fprintf(stderr, "update_thread pid = %d\n", child); > >> > >> for(;;) > >> sleep(100*100*100); > >> > >> > >> return 0; > >> } > >> > >> ----- cut --- cut --- cut -- > > > > -- Hugo Mills | Alert status mauve ocelot: Slight chance of hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea. http://carfax.org.uk/ | PGP: E2AB1DE4 |
signature.asc
Description: Digital signature