On Wed, Apr 24, 2019 at 12:49 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > > > > On 2019/4/23 下午10:50, Filipe Manana wrote: > > On Tue, Apr 23, 2019 at 1:14 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > >> > >> > >> > >> On 2019/4/23 下午7:33, David Sterba wrote: > >>> On Tue, Apr 23, 2019 at 10:16:32AM +0800, Qu Wenruo wrote: > >>>> On 2019/4/23 上午5:09, Jakob Unterwurzacher wrote: > >>>>> I have a user who is reporting ENOSPC errors when running gocryptfs on > >>>>> top of btrfs (ticket: https://github.com/rfjakob/gocryptfs/issues/395 ). > >>>>> > >>>>> What is interesting is that the error gets thrown at write time. This > >>>>> is not supposed to happen, because gocryptfs does > >>>>> > >>>>> fallocate(..., FALLOC_FL_KEEP_SIZE, ...) > >>>>> > >>>>> before writing. > >>>>> > >>>>> I wrote a minimal reproducer in C: > >>>>> https://github.com/rfjakob/fallocate_write > >>>>> This is what it looks like on ext4: > >>>>> > >>>>> $ ../fallocate_write/fallocate_write > >>>>> reading from /dev/urandom > >>>>> writing to ./blob.379Q8P > >>>>> writing blocks of 132096 bytes each > >>>>> [...] > >>>>> fallocate failed: No space left on device > >>>>> > >>>>> On btrfs, it will instead look like this: > >>>>> > >>>>> [...] > >>>>> pwrite failed: No space left on device > >>>>> > >>>>> Is this a bug in btrfs' fallocate implementation or am I reading the > >>>>> guarantees that fallocate gives me wrong? > >>>> > >>>> Since v4.7, this commit changed the how btrfs do NodataCOW check: > >>>> c6887cd11149 ("Btrfs: don't do nocow check unless we have to"). > >>>> > >>>> Before that commit, btrfs always check if they need to reserve space for > >>>> COW, while after that patch, btrfs never checks unless we have no space. > >>>> > >>>> However this screws up other nodatacow space check. > >>>> And due to its age and deep changeset, it's pretty hard to fix it. > >>>> I have tried several times, but it will only cause more problems. > >>> > >>> What if the commit is reverted, if the problem is otherwise hard to fix? > >> > >> Tried reverted, but all other problems came up. > > > > I haven't seen an explanation on why that patch causes ENOSPC or what > > nodatacow space check screw ups it causes. > > > > It seems fine to me, and what we currently do: > > > > 1) For any buffered write, check if there's enough free data space; > > 2) If not try to allocate a new data chunk; > > 3) If that fails check if the file has the "have prealloc extents" > > flag or has the nodatacow flag set > > 4) If any of those conditions is true, check if we can write to the > > existing extent - if it's not shared or no checksums exist in its > > range, meaning it's an unwritten (prealloc) extent, return success to > > userspace > > > > So what's wrong with it? And how does it cause the ENOSPC? > > E.g. > > We have a 128Mb preallocated file extent. > And assume the fs only have 128M free data space, meaning 0 remaining > space at all.
That's a contradicting sentence... > > Then we try to buffer write, which means buffered will just fail as it > will need data space. > > The idea is always here for fallocate/pwrite, just the timing where the > ENOSPC happens. Can't make sense of that sentence as well. So I suppose what you are trying to say is that a write into an unwritten extent causes space allocation, and that can prevent some other write (which is not into an unwritten extent) from being able to allocate space and therefore fail. That's a valid problem that should be temporary. However when allocating space for a write into an unwritten extent (or any nodatacow write) we increment the data space info's bytes_may_use counter, but then if when writeback starts if we don't need to fallback into CoW, we end up never decrementing the bytes_may_use counter (even after writeback completes), leaking it. Not sure if this is the problem you were mentioning or just causing other writes to temporarily fail. thanks > > > We have btrfs/153 for the same reason to fail for a long time, although > it's from quota, but the reason the completely the same. > > Thanks, > Qu > > > > > Trying the reproducer, at least on a 5.0 kernel, does never fail on a > > pwrite for me, but always on fallocate: > > > > $ mkfs.btrfs -f -b $((4 * 1024 * 1024 * 1024)) /dev/sdi > > $ mount /dev/sdi /mnt/sdi > > $ cd /mnt/sdi > > $ /path/to/reproducer > > reading from /dev/urandom > > writing to ./blob.IIa6tH > > writing blocks of 132096 bytes each > > total 125 MiB, 65.52 MiB/s > > total 251 MiB, 44.59 MiB/s > > total 377 MiB, 55.23 MiB/s > > total 503 MiB, 66.21 MiB/s > > total 629 MiB, 59.97 MiB/s > > total 755 MiB, 3.70 MiB/s > > total 881 MiB, 50.24 MiB/s > > total 1007 MiB, 64.51 MiB/s > > total 1133 MiB, 50.70 MiB/s > > total 1259 MiB, 49.29 MiB/s > > total 1385 MiB, 47.93 MiB/s > > total 1511 MiB, 4.00 MiB/s > > total 1637 MiB, 49.85 MiB/s > > total 1763 MiB, 48.11 MiB/s > > total 1889 MiB, 66.62 MiB/s > > total 2015 MiB, 5.60 MiB/s > > total 2141 MiB, 19.58 MiB/s > > total 2267 MiB, 64.80 MiB/s > > total 2393 MiB, 13.23 MiB/s > > total 2519 MiB, 14.95 MiB/s > > fallocate failed: No space left on device > > > > So either that was tested on a rather old kernel or: > > > > 1) we had snapshotting happening between a fallocate and a pwrite (or > > at the same time as the pwrite) > > 2) before the pwrite (or during) the unwritten/prealloc extent was > > reflinked (cp --reflink, clone or dedupe ioctls) > > > > What did I miss here? > > > > Thanks. > > > >> > >> E.g. reserved space underflow. > >> > >> I'll find the old thread and retry again. > >> > >> Thanks, > >> Qu > >> > >>> This seems to break the semantics of fallocate so the performance should > >>> not the main concern here. > >>> > >> > > > > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”