Some clarifications: > Patchset based on 'tmp' branch e6bd18d8938986c997c45f0ea95b221d4edec095. All patches are against btrfs-progs.
==== The rest of rambling is about kernel code, which handles supers. I have read what I've wrote last night (braindump of insane!) and will try to elaborate a bit: > Poking at valgrind warnings I have noticed very worrying problem. > When we (over)write superblock we take 4096 continuous bytes in memory. The common pattern in disk-io.c is the following: struct btrfs_super_block *sb = root->fs_info->sb // or ->sb_for_commit memcpy(dest, sb, BTRFS_SUPER_INFO_SIZE /* 4096 */); With a memcpy we go out of sizeof(*sb) (~2.5K) and pick more fields from fs_info struct. Let's look at them: struct btrfs_fs_info { ... struct btrfs_super_block super_copy; struct btrfs_super_block super_for_commit; struct block_device *__bdev; struct super_block *sb; struct inode *btree_inode; struct backing_dev_info bdi; struct mutex trans_mutex; struct mutex tree_log_mutex; struct mutex transaction_kthread_mutex; struct mutex cleaner_mutex; struct mutex chunk_mutex; struct mutex volume_mutex; struct mutex ordered_operations_mutex; struct rw_semaphore extent_commit_sem; struct rw_semaphore cleanup_work_sem; struct rw_semaphore subvol_sem; struct srcu_struct subvol_srcu; struct list_head trans_list; struct list_head hashers; struct list_head dead_roots; struct list_head caching_block_groups; spinlock_t delayed_iput_lock; struct list_head delayed_iputs; atomic_t nr_async_submits; ... So we copyout (and even checksum!) atomic counters and other volatile stuff (I haven't looked at mutexes and semaphores, but i'm sure there is yet some volatile stuff) > In kernel the structures reside in btrfs_fs_info structure, so we compute > CRC for: > struct btrfs_super_block super_copy; > struct btrfs_super_block super_for_commit; > and then write it to disk. [H]ere we have 2 issues: > 1. kernel pointers and other random stuff leaks out to kernel. > It's nondeterministic and leaks out data (not too bad, > as it should be accessible only for root, but still) > 2. more serious: is there guarantee, that noone will kick-in > between CRC computation and superblock outwrite? > > What if some of mutexes, semaphores or lists will change > it's internal state? Some async thread will kick it > an we will end-up writing superblock with invalid CRC! > This might well be the cause of recend superblock > corruptions under heavy load + hangup retorted to the list. > > Consider the following call chain: > [somewhere in write_dev_supers ...] > > bh->b_end_io = btrfs_end_buffer_write_sync; > crc = ~(u32)0; > crc = btrfs_csum_data(NULL, (char *)sb + > BTRFS_CSUM_SIZE, crc, > BTRFS_SUPER_INFO_SIZE - > BTRFS_CSUM_SIZE); > btrfs_csum_final(crc, sb->csum); Now the problem should be a bit more clear: sb is a thing pointing to the middle of fs_info. We checksum it with data after it in fs_info. > bh = __getblk(device->bdev, bytenr / 4096, > BTRFS_SUPER_INFO_SIZE); > > memcpy(bh->b_data, sb, BTRFS_SUPER_INFO_SIZE); and here we write all the checksummed contents. Is there guard, which prevents things from updating fs_info? > > /* one reference for submit_bh */ > get_bh(bh); > > set_buffer_uptodate(bh); > lock_buffer(bh); > bh->b_end_io = btrfs_end_buffer_write_sync; Am I too paranoid about the issue? Thanks! -- Sergei
signature.asc
Description: PGP signature