On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote:
> On 09/15/2017 12:18 AM, Hugo Mills wrote:
> >    As far as I know, both of these are basically known issues, with no
> > good solution, other than not using O_DIRECT. Certainly the first
> > issue is one I recognise. The second isn't one I recognise directly,
> > but is unsurprising to me.
> > 
> >    There have been discussions -- including developers -- on this list
> > as recent as a month or so ago. The general outcome seems to be that
> > any problems with O_DIRECT are not going to be fixed.
> 
> I missed this thread; could you point it to me ?

   No, you didn't miss it -- you were part of it. :)

   http://www.spinics.net/lists/linux-btrfs/msg68244.html

   Hugo.

> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e 
> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support 
> O_DIRECT at all...
> 
> In fact most of the applications which benefit from O_DIRECT (it comes to me 
> VM e DB), are the ones which need also nodatasum to have good performance.
> 
> One of the strongest point of BTRFS was the checksums; but these are not 
> effective when the file is opened with O_DIRECT; worse there are cases where 
> the file is corrupted and the application got -EIO; not mentioning that the 
> dmesg is filled by "csum failed  ...."
> 
> 
> > 
> >    Hugo.
> > 
> > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> >> Hi all,
> >>
> >> I discovered two bugs when O_DIRECT is used...
> >>
> >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> >>
> >> Normally BTRFS prevents to access the contents of a corrupted file; 
> >> however I was able read the content of a corrupted file simply using 
> >> O_DIRECT
> >>
> >> # in a new btrfs filesystem, create a file
> >> $ sudo mkfs.btrfs -f /dev/sdd5
> >> $ mount /dev/sdd5 t
> >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> >> bs=$((16*1024)) iflag=fullblock count=1024
> >>
> >> # corrupt the file
> >> $ sudo filefrag -v t/abcd 
> >> Filesystem type is: 9123683e
> >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
> >>  ext:     logical_offset:        physical_offset: length:   expected: 
> >> flags:
> >>    0:        0..    3475:      70656..     74131:   3476:            
> >>    1:     3476..    4095:      74212..     74831:    620:      74132: 
> >> last,eof
> >> t/abcd: 2 extents found
> >> $ sudo umount t
> >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> >> /dev/sdd5
> >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> >> corrupting 289406976 copy 1
> >>
> >> # try to access the file; expected result: -EIO
> >> $ sudo mount /dev/sdd5 t
> >> $ dd if=t/abcd | hexdump -c | head
> >> dd: error reading 't/abcd': Input/output error
> >> 0+0 records in
> >> 0+0 records out
> >> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> >>
> >>
> >> # try to access the file using O_DIRECT; expected result: -EIO, instead 
> >> the file is accessible
> >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> >> 0000000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> >> *
> >> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >>
> >> (dmesg report the checksum mismatch)
> >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 
> >> 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> >>
> >> Note the first 4k filled by 0x01 !!!!!
> >>
> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> >> access it, using O_DIRECT
> >> a) no error is returned to the caller
> >> b) instead of the page stored on the disk, it is returned a page filled 
> >> with 0x01 (according also with the function __readpage_endio_check())
> >>
> >>
> >> 2) The second bug, is a more severe bug. If during a writing of a buffer 
> >> with O_DIRECT, the buffer is updated at the same time by a second process, 
> >> the checksum may be incorrect.
> >>
> >> At the end of the email there is the code which shows the problem: two 
> >> process share the same memory: the first write it to the disk, the second 
> >> update the buffer continuously. A third process try to read the file, but 
> >> it got time to time -EIO
> >>
> >> If you ran my code in a btrfs filesystem you got a lot of 
> >>
> >> ERROR: read thread; r = 8192, expected = 16384
> >> ERROR: read thread; r = 8192, expected = 16384
> >> ERROR: read thread; e = 5 - Input/output error
> >> ERROR: read thread; e = 5 - Input/output error
> >>
> >> The firsts lines are related to a shorter read (which may happens). The 
> >> lasts lines are related to a checksum mismatch. The dmesg is filled by 
> >> lines like
> >> [...]
> >> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
> >> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1
> >> [...]
> >>
> >> This is definitely a bug. 
> >>
> >> I think that using O_DIRECT and updating a page at the same time could 
> >> happen in a VM. In BTRFS this  could lead to a wrong checksum. The problem 
> >> is that if BTRFS detects a checksum error during a reading:
> >> a) if O_DIRECT is not used in the read
> >>    * -EIO is returned
> >> Definitely BAD
> >>
> >> b) if O_DIRECT is used in the read
> >>    * it doesn't return the error to the caller
> >>    * it returns a page filled by 0x01 instead of the data from the disk
> >> Even worse than a) 
> >>
> >> Note1: even using O_DIRECT with O_SYNC, the problem still persist.
> >> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, 
> >> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is 
> >> legally.
> >> Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS 
> >> doesn't support it; see https://github.com/zfsonlinux/zfs/issues/224
> >>
> >> BR
> >> G.Baroncelli
> >>
> >> ----- cut --- cut --- cut ----
> >>
> >> #define _GNU_SOURCE
> >> #include <stdio.h>
> >> #include <stdlib.h>
> >> #include <sys/mman.h>
> >> #include <assert.h>
> >> #include <errno.h>
> >> #include <sys/types.h>
> >> #include <sys/stat.h>
> >> #include <fcntl.h>
> >> #include <string.h>
> >> #include <unistd.h>
> >>
> >> #define FILESIZE   (4096*4)
> >>
> >> int fd;
> >> char *buffer = NULL;
> >>
> >> void read_thread(const char *nf) {
> >>    
> >>    void *data = mmap(NULL,  FILESIZE,
> >>                            PROT_READ|PROT_WRITE, 
> >>                            MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> >>    
> >>    assert(data);
> >>    fprintf(stderr, "read_thread:  data = %p\n", data);
> >>    int rfd;
> >>    rfd = open(nf, O_RDONLY);
> >>    
> >>    for(;;) {
> >>            ssize_t r = pread(rfd, data, FILESIZE, 0);
> >>            if (r < 0) {
> >>                    int e = errno;
> >>                    fprintf(stderr, "ERROR: read thread; e = %d - %s\n", 
> >>                           e, strerror(e));
> >>
> >>            } else if (r != FILESIZE) {
> >>                    fprintf(stderr, "ERROR: read thread; r = %ld, expected 
> >> = %d\n", 
> >>                           r, FILESIZE);
> >>            }
> >>    }
> >> }
> >>
> >> void write_thread(void) {
> >>
> >>    for(;;) {
> >>            ssize_t r = pwrite(fd, buffer, FILESIZE, 0);
> >>            assert(r == FILESIZE);
> >>    }
> >> }
> >>
> >> void update_thread(void) {
> >>
> >>    for(;;) {
> >>            int i;
> >>            for (i = 0 ; i < FILESIZE ; i++)
> >>                    buffer[i] += i+10;
> >>    }
> >> }
> >>
> >>
> >> int main(int argc, char **argv) {
> >>    
> >>    if (argc < 2) {
> >>            fprintf(stderr, "usage: %s <fname>\n", argv[0]);
> >>            exit(100);
> >>    }
> >>    
> >>    
> >>    buffer = mmap(NULL,  FILESIZE,
> >>                            PROT_READ|PROT_WRITE, 
> >>                            MAP_SHARED|MAP_ANONYMOUS, -1, 0);
> >>    
> >>    assert(buffer);
> >>    fprintf(stderr, "main:  data = %p\n", buffer);
> >>    
> >>    fd = open(argv[1], O_RDWR|O_DIRECT|O_CREAT, 0660);
> >>    assert(fd>=0);
> >>    
> >>    ssize_t r = pwrite(fd, buffer, FILESIZE, 0);
> >>    assert(r == FILESIZE);
> >>    
> >>    pid_t child;
> >>    
> >>    child = fork();
> >>    assert(child >= 0);
> >>    if (child == 0)
> >>            write_thread();
> >>    fprintf(stderr, "write_thread pid = %d\n", child);
> >>    
> >>    child = fork();
> >>    assert(child >= 0);
> >>    if (child == 0)
> >>            read_thread(argv[1]);
> >>    fprintf(stderr, "read_thread pid = %d\n", child);
> >>    
> >>    child = fork();
> >>    assert(child >= 0);
> >>    if (child == 0)
> >>            update_thread();
> >>    fprintf(stderr, "update_thread pid = %d\n", child);
> >>    
> >>    for(;;)
> >>            sleep(100*100*100);
> >>
> >>    
> >>    return 0;
> >> }
> >>
> >> ----- cut --- cut --- cut -- 
> > 
> 
> 

-- 
Hugo Mills             | Alert status mauve ocelot: Slight chance of
hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

Attachment: signature.asc
Description: Digital signature

Reply via email to