Excerpts from cwillu's message of 2011-08-01 19:28:35 -0400: > On Mon, Aug 1, 2011 at 12:21 PM, Chris Mason <[email protected]> wrote: > > Excerpts from Josef Bacik's message of 2011-08-01 14:01:35 -0400: > >> On 08/01/2011 01:54 PM, Chris Mason wrote: > >> > Excerpts from Josef Bacik's message of 2011-08-01 12:03:34 -0400: > >> >> On 08/01/2011 11:45 AM, Chris Mason wrote: > >> >>> Excerpts from Josef Bacik's message of 2011-08-01 11:21:34 -0400: > >> >>>> Hello, > >> >>>> > >> >>>> We've seen a lot of reports of people having these constant long > >> >>>> pauses > >> >>>> when doing things like sync or such. The stack traces usually all > >> >>>> look > >> >>>> the same, one is btrfs-transaction stuck in btrfs_wait_marked_extents > >> >>>> and one is btrfs-submit-# stuck in get_request_wait. I had originally > >> >>>> thought this was due to the new plugging stuff, but I think it just > >> >>>> makes the problem happen more quickly as we've seen that 2.6.38 which > >> >>>> we > >> >>>> thought was ok will still have the problem happen if given enough > >> >>>> time. > >> >>>> > >> >>>> I _think_ this is because of the way we write out metadata in the > >> >>>> transaction commit phase. We're doing write_on_page for every dirty > >> >>>> page in the btree during the commit. This sucks because basically we > >> >>>> end up with one bio per page, which makes us blow out our nr_requests > >> >>>> constantly, which is why btrfs-submit-# is always stuck in > >> >>>> get_request_wait. What we need to do instead is use > >> >>>> filemap_fdatawrite > >> >>>> which will do a WB_SYNC_ALL but will do it via writepages, so > >> >>>> hopefully > >> >>>> we will get less bios and this problem will go away. Please try this > >> >>>> very hastily put together patch if you are experiencing this problem > >> >>>> and > >> >>>> let me know if it fixes it for you. Thanks, > >> >>> > >> >>> I'm definitely curious to hear if this helps, but I think it might > >> >>> cause > >> >>> a different set of problems. It writes everything that is dirty on the > >> >>> btree, which includes a lot of things we've cow'd in the current > >> >>> transaction and marked dirty. They will have to go through COW again > >> >>> if someone wants to modify them again. > >> >>> > >> >> > >> >> But this is happening in the commit after we've done all of our work, we > >> >> shouldn't be dirtying anything else at this point right? > >> > > >> > The commit code is setup to unblock people before we start the IO: > >> > > >> > trans->transaction->blocked = 0; > >> > spin_lock(&root->fs_info->trans_lock); > >> > root->fs_info->running_transaction = NULL; > >> > root->fs_info->trans_no_join = 0; > >> > spin_unlock(&root->fs_info->trans_lock); > >> > mutex_unlock(&root->fs_info->reloc_mutex); > >> > > >> > wake_up(&root->fs_info->transaction_wait); > >> > > >> > ret = btrfs_write_and_wait_transaction(trans, root); > >> > > >> > So, we should have concurrent FS mods for a new transaction while we are > >> > writing out this old transaction. > >> > > >> > >> Ah right, but then this brings up another question, we shouldn't cow > >> them again since we would have set the new transid. And isn't this kind > >> of bad, since somebody could come in and dirty a piece of metadata > >> before we have a chance to write it out for this transaction, so we end > >> up writing out the new data instead of what we are trying to commit? > > > > I think we're mixing together different ideas here. If we're doing a > > commit on transaction N, we allow N+1 to start while we're doing the > > btrfs_write_and_wait_transaction(). N+1 might allocate and dirty a new > > block, which btrfs_write_and_wait_transaction might start IO on. > > > > Strictly speaking this isn't a problem. It doesn't break any rules of > > COW because we're allowed to write metadata at any time. But, once we > > do write it, we must COW it again if we want to change it. So, anything > > that btrfs_write_and_wait_transaction() catches from transaction N+1 is > > likely to make more work for us because future mods will have to > > allocate a new block. Basically it's wasted IO. > > > > But, it's also free IO, assuming it was contiguous. The problem is that > > write_cache_pages isn't actually making sure it was contiguous, so we > > end up doing many more writes than we could have. > > First user ("youagree") reported back on irc: > > <youagree> guys, just came to report its much worse with josef's patch > <youagree> now i can hardly start anything, it's slowed down most of the time
Josef's filemap_fdatawrite patch? He sent a second one to the list that gets rid of the extra IO done by the current code. That's the one we hope will fix things. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
