I'll look into this after I figure out the cgroups scheduler thing I'm looking at. Thanks,
Josef Sent from my iPhone > On May 12, 2017, at 5:41 AM, Nikolay Borisov <[email protected]> wrote: > > Hello, > > I've been investigating a severe performance regression in generic/027 on > latest master. Specifically I traced that back to the rm portion of the test. > rm now takes 4 times longer to delete the same amount of files. After > bisecting I > identified the culprit as being 957780eb2788 ("Btrfs: introduce > ticketed enospc infrastructure"). With this commit applied rm > would take 65 seconds to delete around 10k files whereas before > it would take 15 seconds. I guess this is only triggered in the > case where there isn't much available metadata space since no one > has complained so far. > > Further analysis showed that the extra time comes from the fact there > are a lot more transaction commits happening with the ticketed enosp. > Here are some stats obtained via systemtap instrumentation of flush_space: > > Good: > btrfs_trans_commit: 1088 > FLUSH_DELAYED_ITEMS_NR => 12281 > > FLUSH_DELAYED_ITEMS => 6151 > > FLUSH_DELALLOC => 6124 > > FLUSH_DELALLOC_WAIT => 6124 > > ALLOC_CHUNK => 6124 > > COMMIT_TRANS => 1087 > > > BAD: > btrfs_trans_commit: 4711 > FLUSH_DELAYED_ITEMS_NR => 9606 > > FLUSH_DELAYED_ITEMS => 4709 > > FLUSH_DELALLOC => 4709 > > FLUSH_DELALLOC_WAIT => 4709 > > ALLOC_CHUNK => 4709 > > COMMIT_TRANS => 4709 > > So we do 4 times more transaction commits. I've also traced the > 2 branches in reserve_metadata_bytes (__reserve_metadata_bytes) respectively, > and here are the numbers: > > Good: > Failures: 9383 > > Bad: > Failures: 1579 > > Those numbers count how many times have we took the > if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { > > branch in the good case and how many times we executed > the 'queue_work' function call in the bad case i.e. the > numbers of time we have to flush_space. Here are the stack > traces of btrfs_commit_trans: > > Good: > 0xffffffff81387ca0 : btrfs_commit_transaction+0x0/0xb40 [kernel] > 0xffffffff813726eb : flush_space+0x9b/0x4e0 [kernel] > 0xffffffff81372fe4 : reserve_metadata_bytes+0x2f4/0x710 [kernel] > 0xffffffff81373425 : btrfs_block_rsv_add+0x25/0x60 [kernel] > 0xffffffff81388bd3 : start_transaction+0x3f3/0x570 [kernel] > 0xffffffff813891fa : btrfs_start_transaction_fallback_global_rsv+0x2a/0x200 > [kernel] > 0xffffffff81392785 : btrfs_unlink+0x35/0xb0 [kernel] > 0xffffffff811c0e0a : vfs_unlink+0xba/0x1a0 [kernel] > 0xffffffff811c5cdb : do_unlinkat+0x24b/0x2a0 [kernel] > 0xffffffff811c646b : SyS_unlinkat+0x1b/0x30 [kernel] > 0xffffffff816d0b40 : entry_SYSCALL_64_fastpath+0x23/0xc1 [kernel] > > Naturally. > > Bad: > > 0xffffffff81388b30 : btrfs_commit_transaction+0x0/0xb40 [kernel] > 0xffffffff81372d73 : flush_space+0x93/0x510 [kernel] > 0xffffffff813732f3 : btrfs_async_reclaim_metadata_space+0x103/0x460 [kernel] > 0xffffffff8106d86f : process_one_work+0x1cf/0x660 [kernel] > 0xffffffff8106dd4e : worker_thread+0x4e/0x490 [kernel] > 0xffffffff810745ae : kthread+0xfe/0x120 [kernel] > 0xffffffff816d1c0f : ret_from_fork+0x1f/0x40 [kernel] > 0xffffffff810744b0 : kthread+0x0/0x120 [kernel] > > Also no surprises here, since flushing is alway async now. Further breaking > down which site actually queued the btrfs_async_reclaim_metadata_space > reveals the following: > > traces[ > > 0xffffffff81373774 : __reserve_metadata_bytes+0x124/0xa80 [kernel] > > 0xffffffff813740f4 : reserve_metadata_bytes+0x24/0x1c0 [kernel] > > 0xffffffff813742b5 : btrfs_block_rsv_add+0x25/0x60 [kernel] > > 0xffffffff81389a63 : start_transaction+0x3f3/0x570 [kernel] > > 0xffffffff8138a08a : btrfs_start_transaction_fallback_global_rsv+0x2a/0x200 > [kernel] > 0xffffffff81393615 : btrfs_unlink+0x35/0xb0 [kernel] > > 0xffffffff811c0e0a : vfs_unlink+0xba/0x1a0 [kernel] > > 0xffffffff811c5cdb : do_unlinkat+0x24b/0x2a0 [kernel] > > 0xffffffff811c646b : SyS_unlinkat+0x1b/0x30 [kernel] > > 0xffffffff816d19c0 : entry_SYSCALL_64_fastpath+0x23/0xc1 [kernel] > > ] = 3905 > > __reserve_metadata_bytes+0x124/0xa80 is the queue_work from the > if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) branch > e.g. when we managed to do our reservation but still would like to > do preemptive space flushing. > > > > traces[ > > 0xffffffff81373eed : __reserve_metadata_bytes+0x89d/0xa80 [kernel] > > 0xffffffff813740f4 : reserve_metadata_bytes+0x24/0x1c0 [kernel] > > 0xffffffff813742b5 : btrfs_block_rsv_add+0x25/0x60 [kernel] > > 0xffffffff81389a63 : start_transaction+0x3f3/0x570 [kernel] > > 0xffffffff8138a08a : btrfs_start_transaction_fallback_global_rsv+0x2a/0x200 > [kernel] > 0xffffffff81393615 : btrfs_unlink+0x35/0xb0 [kernel] > > 0xffffffff811c0e0a : vfs_unlink+0xba/0x1a0 [kernel] > > 0xffffffff811c5cdb : do_unlinkat+0x24b/0x2a0 [kernel] > > 0xffffffff811c646b : SyS_unlinkat+0x1b/0x30 [kernel] > > 0xffffffff816d19c0 : entry_SYSCALL_64_fastpath+0x23/0xc1 [kernel] > > ] = 1579 > > __reserve_metadata_bytes+0x89d/0xa80 this is the queue from the > if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { branch. E.g. when we > failed to do our reservation. > > SO with the ticketed ENOSPC we actually have less reservation failures but > ironically a lot more transaction commits. The end result being slower > operations. > > I'm also sharing two files which have a lot more debugging by printing > the space_info state during failure to do reservations in both the good > case (pre enospc ticketed work) and the bad (after ticketed enospc) case: > > good: > https://urldefense.proofpoint.com/v2/url?u=http-3A__sprunge.us_QBSc&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=pnE03zO5T9m4F56Foc6ZXiJTwxpINKgJU3Uxubyd9v4&s=ff9L8y28iZTOan9sR9eASmJLuUEMvOk6O566VYWllpw&e= > > bad: > https://urldefense.proofpoint.com/v2/url?u=http-3A__sprunge.us_PaNI&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=pnE03zO5T9m4F56Foc6ZXiJTwxpINKgJU3Uxubyd9v4&s=RJBYXXgpK75teV_6E1Ljy1MiWs9ate_IolT6O294Bb8&e= > > > I'm running out of ideas where this might be happening or how to further > debug it. Any pointers would be much appreciated. I've had an off-list > discussion with Josef re. this and one thing he suggested is that we > might be hitting the ->failfast in use_block_rsv however in both > cases that code was never triggered. All measurements have been made > only during the rm portion of the test and not the whole "Create files and > delete them" phase. > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
