I am seeing a lot of spurious I/O errors that look like they come from the cache-facing side of btrfs. While running a heavy load with some extent-sharing (e.g. building 20 Linux kernels at once from source trees copied with 'cp -a --reflink=always'), some files will return spurious EIO on read. It happens often enough to prevent a Linux kernel build about 1/3 of the time.
I believe the I/O errors to be spurious because:
- there is no kernel message of any kind during the event
- scrub detects 0 errors
- device stats report 0 errors
- the drive firmware reports nothing wrong through SMART
- there seems to be no attempt to read the disk when the error
is reported
- "sysctl vm.drop_caches={1,2}" makes the I/O error go away.
Files become unreadable at random, and stay unreadable indefinitely;
however, any time I discover a file that gives EIO on read, I can
poke vm.drop_caches and make the EIO go away. The file can then be
read normally and has correct contents. The disk does not seem to be
involved in the I/O error return.
This seems to happen more often when snapshots are being deleted;
however, it occurs on systems with no snapshots as well (though
in these cases the system had snapshots in the past).
When a file returns EIO on read, other snapshots of the same file also
return EIO on read. I have not been able to test whether this affects
reflink copies (clones) as well.
Observed from 3.17..3.18.3. All filesystems affected use skinny-metadata.
No filesystems that are not using skinny-metadata seem to have this
problem.
signature.asc
Description: Digital signature
