* Daniel Henrique Barboza ([email protected]) wrote: > Hi, > > I've been working in the last two months in a miscompare issue that happens > when using a raid device and a SATA as scsi-hd (emulated SCSI) with > cache=none and io=threads during a hardware stress test. I'll summarize it > here as best as I can without creating a great wall of text - Red Hat folks > can check [1] for all the details. > > Using the following setup: > > - Host is a POWER9 RHEL 7.5-alt: kernel 4.14.0-49.1.1.el7a.ppc64le, > qemu-kvm-ma 2.10.0-20.el7 (also reproducible with upstream QEMU) > > - Guest is RHEL 7.5-alt using the same kernel as the host, using two storage > disks (a 1.8 Tb raid and a 446Gb SATA drive) as follows: > > <disk type='block' device='disk'> > <driver name='qemu' type='raw' cache='none'/> > <source dev='/dev/disk/by-id/scsi-3600605b000a2c110ff0004053d84a61b'/> > <target dev='sdc' bus='scsi'/> > <alias name='scsi0-0-0-2'/> > <address type='drive' controller='0' bus='0' target='0' unit='2'/> > </disk> > > Both block devices have WCE off in the host. > > With this env, we found problems when running a stress test called HTX [2]. > At a given time (usually after 24+ hours of test) HTX finds a data > miscompare in one of the devices. This is an example: > > ------- > > Device name: /dev/sdb > Total blocks: 0x74706daf, Block size: 0x200 > Rule file name: /usr/lpp/htx/rules/reg/hxestorage/default.hdd > Number of Rulefile passes (cycle) completed: 0 > Stanza running: rule_6, Thread no.: 8 > Oper performed: wrc, Current seek type: SEQ > LBA no. where IO started: 0x94fa > Transfer size: 0x8400 > > Miscompare Summary: > =================== > LBA no. where miscomapre started: 0x94fa > LBA no. where miscomapre ended: 0x94ff > Miscompare start offset (in bytes): 0x8 > Miscomapre end offset (in bytes): 0xbff > Miscompare size (in bytes): 0xbf8 > > Expected data (at miscomapre offset): 8c9aea5a736462000000000000007275 > Actual data (at miscomapre offset): 889aea5a736462000000000000007275
Are all the miscompares single bit errors like that one? Is the test doing single bit manipulation or is that coming out of the blue? Dave > ----- > > > This means that the test executed a write at LBA 0x94fa and, after > confirming that the write was completed, issue 2 reads in the same LBA to > assert the written contents and found out a mismatch. > > > I've tested all sort of configurations between disk vs LUN, cache modes and > AIO. My findings are: > > - using device='lun' instead of device='disk', I can't reproduce the issue > doesn't matter what other configurations are; > - using device='disk' but with cache='writethrough', issue doesn't happen > (haven't checked other cache modes); > - using device='disk', cache='none' and io='native', issue doesn't happen. > > > The issue seems to be tied with the combination device=disk + cache=none + > io=threads. I've started digging into the SCSI layer all the way down to the > block backend. With a shameful amount of logs I've discovered that, in the > write that the test finds a miscompare, in block/file-posix.c: > > - when doing the write, handle_aiocb_rw_vector() returns success, pwritev() > reports that all bytes were written > - in both reads after the write, handle_aiocb_rw_vector returns success, all > bytes read by preadv(). In both reads, the data read is different from the > data written by the pwritev() that happened before > > In the discussions at [1], Fam Zheng suggested a test in which we would take > down the number of threads created in the POSIX thread pool from 64 to 1. > The idea is to ensure that we're using the same thread to write and read. > There was a suspicion that the kernel can't guarantee data coherency between > different threads, even if using the same fd, when using pwritev() and > preadv(). This would explain why the following reads in the same fd would > fail to retrieve the same data that was written before. After doing this > modification, the miscompare didn't reproduce. > > After reverting the thread pool number change, I've made a couple of > attempts trying to flush before read() and flushing after write(). Both > attempts failed - the miscompare appears in both scenarios. This enforces > the suspicion we have above - if data coherency can't be granted between > different threads, flushing in different threads wouldn't make a difference > too. I've also tested a suggestion from Fam where I started the disks with > "cache.direct=on,cache.no-flush=off" - bug still reproduces. > > > This is the current status of this investigation. I decided to start a > discussion here, see if someone can point me something that I overlooked or > got it wrong, before I started changing the POSIX thread pool behavior to > see if I can enforce one specific POSIX thread to do a read() if we had a > write() done in the same fd. Any suggestions? > > > > ps: it is worth mentioning that I was able to reproduce this same bug in a > POWER8 system running Ubuntu 18.04. Given that the code we're dealing with > doesn't have any arch-specific behavior I wouldn't be surprised if this bug > is also reproducible in other archs like x86. > > > Thanks, > > Daniel > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1561017 > [2] https://github.com/open-power/HTX -- Dr. David Alan Gilbert / [email protected] / Manchester, UK
