On Fri, Feb 12, 2010 at 8:32 AM, Josef Bacik <jo...@redhat.com> wrote: > On Fri, Feb 12, 2010 at 08:27:00AM -0800, Mike Fedyk wrote: >> On Fri, Feb 12, 2010 at 8:22 AM, Josef Bacik <jo...@redhat.com> wrote: >> > On Fri, Feb 12, 2010 at 08:18:01AM -0800, Mike Fedyk wrote: >> >> On Fri, Feb 12, 2010 at 7:19 AM, Josef Bacik <jo...@redhat.com> wrote: >> >> > On Thu, Feb 11, 2010 at 08:50:48PM -0800, Mike Fedyk wrote: >> >> >> On Thu, Feb 11, 2010 at 7:11 PM, Chris Ball <c...@laptop.org> wrote: >> >> >> > > echo x1 > /mnt/x/d/foo.txt || exit 2 >> >> >> > > btrfsctl -s /mnt/x/snap /mnt/x/d >> >> >> > >> >> >> > You're just missing a sync/fsync() between these two lines. >> >> >> > >> >> >> > We argued on IRC a while ago about whether this is a sensible >> >> >> > default; >> >> >> > cmason wants the no-sync version of snapshot creation to be >> >> >> > available, >> >> >> > but was amenable to the idea of changing the default to be sync >> >> >> > before >> >> >> > snapshot, since it was pointed out that no-one other than him had >> >> >> > understood we were supposed to be running sync first. >> >> >> > >> >> >> You're saying that it only snapshots the on-disk data structures and >> >> >> not the in-memory versions? That can only lead to pain. What do you >> >> >> do if something else during this race condition? What would a sync do >> >> >> to solve this? Have the semantics of sync been changed in btrfs from >> >> >> "sync everything that hasn't been written yet" to "sync this >> >> >> subvolume"? >> >> >> >> >> > >> >> > Welcome to delalloc. You either get fast writes or you get all of your >> >> > data on >> >> > the disk every 5 seconds. If you don't like delalloc, use ext3. The >> >> > data >> >> > you've written to memory doesn't go down to disk unless explicitly told >> >> > to, such >> >> > as >> >> > >> >> > 1) fsync - this is obvious >> >> > 2) vm - the vm has decided that this dirty page has been sitting around >> >> > long >> >> > enough and should be written back to the disk, could happen now, could >> >> > happen 10 >> >> > years from now. >> >> > 3) sync - this is not as obvious. sync doesn't mean anything than >> >> > "start >> >> > writing back dirty data to the fs", and returns before it's done. For >> >> > btrfs >> >> > what that means is we run through _every_ inode that has delalloc pages >> >> > associated with them and start writeback on them. This will get most >> >> > of your >> >> > data into the current transaction, which is when the snapshot happens. >> >> > >> >> > If you don't want empty files, do something like this >> >> > >> >> > btrfsctl -c /dir/to/volume >> >> > btrfsctl -s /dir/to/volume/snapshotname /dir/to/volume >> >> > >> >> > this is what we do with yum and its rollback plugin, and it works out >> >> > quite >> >> > well. Thanks, >> >> > >> >> >> >> Then you broke your ordering guarantee. If the data isn't there, the >> >> meta-data shouldn't be there either. So the snapshots made before the >> >> data hits a transaction shouldn't have the file at all. >> > >> > Nope, what is happening is >> > >> > fd = creat("file") <- this is metadata that needs to be written >> > write(fd, buf) <- because of delalloc there is no metadata that is >> > created >> > for this operation, therefore it doesn't need to be written out. >> > close(fd) >> > >> > so the file has metadata created for it, which needs to be written out. >> > Because >> > of delalloc there are no extents created or anything for the data, >> > therefore >> > there is nothing to write. Thanks, >> > >> >> So file creation is effectively synchronous? So I could create a >> benchmark that creates millions of files and it would be limited to >> the IO OP performance of the disks? >> >> Why does file creation need to hit the disk before the contents (with >> limits to size of data that can fit in one transaction)? > > File creation isn't synchronous, it just modifies metadata, which needs to be > committed when the transaction commits. So if you creat millions of files you > are going to be held up every 30 seconds as the transaction commits and writes > all the files you were able to create within that 30 seconds, same as _any_ > filesystem that does ordered mode. > > Creating a file is a metadata operation, and _any_ metadata operation has to > be > committed to disk when the transaction commits in order to maintain a coherent > fs. Thanks, >
Thanks, I understand better now. What I still don't understand though is that the create could have taken up to 30 seconds to commit and the same for the few bytes of data, but a few ms later a snapshot was made and the metadata change was there and the data change was not. Could it have happened that the snapshot would not have the newly created file and this was just a timing issue that should not be relied upon? I'm just wondering why that file was there at all. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html