On 05/27/2018 11:41 AM, Nikolay Borisov wrote: > > > On 27.05.2018 08:50, Andrei Borzenkov wrote: >> 23.05.2018 09:32, Nikolay Borisov пишет: >>> >>> >>> On 22.05.2018 23:05, ein wrote: >>>> Hello devs, >>>> >>>> I tested BTRFS in production for about a month: >>>> >>>> 21:08:17 up 34 days, 2:21, 3 users, load average: 0.06, 0.02, 0.00 >>>> >>>> Without power blackout, hardware failure, SSD's SMART is flawless etc. >>>> The tests ended with: >>>> >>>> root@node0:~# dmesg | grep BTRFS | grep warn >>>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root >>>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1 >>>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root >>>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1 >>>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root >>>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1 >>>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root >>>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1 >>>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root >>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1 >>>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root >>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1 >>>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root >>>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489 >>>> mirror 1 >>>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root >>>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1 >>>> mirror 1 >>>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root >>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8 >>>> mirror 1 >>>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root >>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8 >>>> mirror 1 >>>> >>>> Above has been revealed during below command and quite high IO usage by >>>> few VMs (Linux on top Ext4 with firebird database, lots of random >>>> read/writes, two others with Windows 2016 and Windows Update in the >>>> background): >>> >>> I believe you are hitting the issue described here: >>> >>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html >>> >>> Essentially the way qemu operates on vm images atop btrfs is prone to >>> producing such errors. As a matter of fact, other filesystems also >>> suffer from this(i.e pages modified while being written, however due to >>> lack of CRC on the data they don't detect it). Can you confirm that >>> those inodes (312/314/319/219962/219915) belong to vm images files? >>> >>> IMHO the best course of action would be to disable checksumming for you >>> vm files. >>> >>> >>> For some background I suggest you read the following LWN articles: >>> >>> https://lwn.net/Articles/486311/ >>> https://lwn.net/Articles/442355/ >>> >> >> Hmm ... according to these articles, "pages under writeback are marked >> as not being writable; any process attempting to write to such a page >> will block until the writeback completes". And it says this feature is >> available since 3.0 and btrfs has it. So how comes it still happens? >> Were stable patches removed since then? > > If you are using buffered writes, then yes you won't have the problem. > However qemu by default bypasses host's page cache and instead uses DIO: > > https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs
I can confirm that writing data to the filesystem on guest side is not buffered at host with config: <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='none' io='native'/> <source file='/var/lib/libvirt/images/db.raw'/> <target dev='vda' bus='virtio'/> [...] </disk> Because buff/cache memory usage stays unchanged at host during high sequential writing and there's no kworker/flush process committing the data. How qemu can avoid dirty page buffering? There's nothing else than:ppoll, read, io_sumbit and write in strace: read(52, "\1\0\0\0\0\0\0\0", 512) = 8 io_submit(0x7f35367f7000, 2, [{pwritev, fildes=19, iovec=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640}, {iov_base="\0\0\0\0\0\0 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=679 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=679936}, {iov_bas \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=679936}, {iov_base="\0\0\0\0\0\ 1048576}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1048576}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ read(38, "\3\0\0\0\0\0\0\0", 512) = 8 ppoll([{fd=52, events=POLLIN|POLLERR|POLLHUP}, {fd=38, events=POLLIN|POLLERR|POLLHUP}, {fd=10, events=POLLIN|POLLERR|POLLHUP}], 3, NULL, NULL, 8) = 1 ([{fd=52, revents=POLLIN}]) read(52, "\1\0\0\0\0\0\0\0", 512) = 8 io_submit(0x7f35367f7000, 1, [{pwritev, fildes=19, iovec=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640}, {iov_base="\0\0\0\0\0\0 \0\0\0\0\0\0\0\0\0\0"..., iov_len=1048576}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0 ppoll([{fd=52, events=POLLIN|POLLERR|POLLHUP}, {fd=38, events=POLLIN|POLLERR|POLLHUP}, {fd=10, events=POLLIN|POLLERR|POLLHUP}], 3, {tv_sec=0, tv_nsec=0}, NULL, 8) = 2 ([{fd=52, re -- PGP Public Key (RSA/4096b): ID: 0xF2C6EA10 SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html