On wed, 10 Apr 2013 21:45:43 +0300, Alex Lyakas wrote:
> Hi Miao,
> I attempted to fix the issue by not joining a transaction that has
> trans->in_commit set. I did something similar to what
> wait_current_trans() does, but I did:
> 
> smp_rmb();
> if (cur_trans && cur_trans->in_commit) {
> ...
> wait_event(root->fs_info->transaction_wait,  !cur_trans->blocked);
> ...

But it will introduce deadlock if we need flush some dirty pages, for
example: run ordered operation.

> 
> I also had to change the order of setting in_commit and blocked in
> btrfs_commit_transaction:
>       trans->transaction->blocked = 1;
>       trans->transaction->in_commit = 1;
>       smp_wmb();
> to make sure that if in_commit is set, then blocked cannot be 0,
> because btrfs_commit_transaction haven't set it yet to 1.

we need smp_wmb() between 
        trans->transaction->blocked = 1;
and
        trans->transaction->in_commit = 1;

Or the cpu may set blocked after in_commmit.

> However, with this fix I observe two issues:
> # With large trees and heavy commits, join_transaction() is delayed
> sometimes by 1-3 seconds. This delays the host IO by too much.
> # With this fix, I think too many transactions happen. Basically with
> this fix, once transaction->in_commit is set, then I insist to open a
> new transaction and not to join the current one. It has some bad
> influence on host response times pattern, but I cannot exactly tell
> why is that.
> 
> Did you have other fix in mind?
> 
> Without the fix, I observe sometimes commits that take like 80
> seconds, out of which like 50 seconds are spent in the do-while loop
> of btrfs_commit_transaction.

I'm making the patch to fix this problem, my fix is:
- don't flush the dirty page during the commit if we create a snapshot
- introduce a new counter to count the external 
writers(TRANS_USERSPACE/TRANS_START)
  and if this counter is zero, we will break the while loop.
- if flushoncommit is set, we start delalloc flush before the while loop, not 
in the
  loop, so we don't flush the dirty pages again and again.
- introduce a new transaction handle type named TRANS_JOIN_ENDIO, which is used 
in the endio
  process.
- introduce a new state for transaction commit, at this state, we block 
TRANS_JOIN, but
  don't block TRANS_JOIN_ENDIO.

Thanks
Miao

> 
> Thanks,
> Alex.
> 
> 
> 
> On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
> <alex.bt...@zadarastorage.com> wrote:
>> Hi Miao,
>>
>> On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie <mi...@cn.fujitsu.com> wrote:
>>> On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
>>>> Hi Miao,
>>>> I am seeing another issue. Your fix prevents from TRANS_START to get
>>>> in the way of a committing transaction. But it does not prevent from
>>>> TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
>>>> following loop:
>>>>
>>>> do {
>>>>     // attempt to do some useful stuff and/or sleep
>>>> } while (atomic_read(&cur_trans->num_writers) > 1 ||
>>>>                (should_grow && cur_trans->num_joined != joined));
>>>>
>>>> What I see is basically that new writers join the transaction, while
>>>> btrfs_commit_transaction() does this loop. I see
>>>> cur_trans->num_writers decreasing, but then it increases, then
>>>> decreases etc. This can go for several seconds during heavy IO load.
>>>> There is nothing to prevent new TRANS_JOIN writers coming and joining
>>>> a transaction over and over, thus delaying transaction commit. The IO
>>>> path uses TRANS_JOIN; for example run_delalloc_nocow() does that.
>>>>
>>>> Do you observe such behavior? Do you believe it's problematic?
>>>
>>> I know this behavior, there is no problem with it, the latter code
>>> will prevent from TRANS_JOIN.
>>>
>>> 1672         spin_lock(&root->fs_info->trans_lock);
>>> 1673         root->fs_info->trans_no_join = 1;
>>> 1674         spin_unlock(&root->fs_info->trans_lock);
>>> 1675         wait_event(cur_trans->writer_wait,
>>> 1676                    atomic_read(&cur_trans->num_writers) == 1);
>>>
>> Yes, this code prevents anybody from joining, but before
>> btrfs_commit_transaction() gets to this code, it may spend sometimes
>> 10 seconds (in my tests) in the do-while loop, while new writers come
>> and go. Basically, it is not deterministic when the do-while loop will
>> exit, it depends on the IO pattern.
>>
>>> And if we block the TRANS_JOIN at the place you point out, the deadlock
>>> will happen because we need deal with the ordered operations which will
>>> use TRANS_JOIN here.
>>>
>>> (I am dealing with the problem you said above by adding a new type of
>>> TRANS_* now)
>>
>> Thanks.
>> Alex.
>>
>>
>>>
>>> Thanks
>>> Miao
>>>
>>>> Thanks,
>>>> Alex.
>>>>
>>>>
>>>> On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie <mi...@cn.fujitsu.com> wrote:
>>>>> On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
>>>>>> Hi Miao,
>>>>>> can you please explain your solution a bit more.
>>>>>>
>>>>>> On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie <mi...@cn.fujitsu.com> wrote:
>>>>>>> Now btrfs_commit_transaction() does this
>>>>>>>
>>>>>>> ret = btrfs_run_ordered_operations(root, 0)
>>>>>>>
>>>>>>> which async flushes all inodes on the ordered operations list, it 
>>>>>>> introduced
>>>>>>> a deadlock that transaction-start task, transaction-commit task and the 
>>>>>>> flush
>>>>>>> workers waited for each other.
>>>>>>> (See the following URL to get the detail
>>>>>>>  http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
>>>>>>>
>>>>>>> As we know, if ->in_commit is set, it means someone is committing the
>>>>>>> current transaction, we should not try to join it if we are not JOIN
>>>>>>> or JOIN_NOLOCK, wait is the best choice for it. In this way, we can 
>>>>>>> avoid
>>>>>>> the above problem. In this way, there is another benefit: there is no 
>>>>>>> new
>>>>>>> transaction handle to block the transaction which is on the way of 
>>>>>>> commit,
>>>>>>> once we set ->in_commit.
>>>>>>>
>>>>>>> Signed-off-by: Miao Xie <mi...@cn.fujitsu.com>
>>>>>>> ---
>>>>>>>  fs/btrfs/transaction.c |   17 ++++++++++++++++-
>>>>>>>  1 files changed, 16 insertions(+), 1 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>>>>>> index bc2f2d1..71b7e2e 100644
>>>>>>> --- a/fs/btrfs/transaction.c
>>>>>>> +++ b/fs/btrfs/transaction.c
>>>>>>> @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct 
>>>>>>> btrfs_root *root)
>>>>>>>         root->commit_root = btrfs_root_node(root);
>>>>>>>  }
>>>>>>>
>>>>>>> +static inline int can_join_transaction(struct btrfs_transaction *trans,
>>>>>>> +                                      int type)
>>>>>>> +{
>>>>>>> +       return !(trans->in_commit &&
>>>>>>> +                type != TRANS_JOIN &&
>>>>>>> +                type != TRANS_JOIN_NOLOCK);
>>>>>>> +}
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * either allocate a new transaction or hop into the existing one
>>>>>>>   */
>>>>>>> @@ -86,6 +94,10 @@ loop:
>>>>>>>                         spin_unlock(&fs_info->trans_lock);
>>>>>>>                         return cur_trans->aborted;
>>>>>>>                 }
>>>>>>> +               if (!can_join_transaction(cur_trans, type)) {
>>>>>>> +                       spin_unlock(&fs_info->trans_lock);
>>>>>>> +                       return -EBUSY;
>>>>>>> +               }
>>>>>>>                 atomic_inc(&cur_trans->use_count);
>>>>>>>                 atomic_inc(&cur_trans->num_writers);
>>>>>>>                 cur_trans->num_joined++;
>>>>>>> @@ -360,8 +372,11 @@ again:
>>>>>>>
>>>>>>>         do {
>>>>>>>                 ret = join_transaction(root, type);
>>>>>>> -               if (ret == -EBUSY)
>>>>>>> +               if (ret == -EBUSY) {
>>>>>>>                         wait_current_trans(root);
>>>>>>> +                       if (unlikely(type == TRANS_ATTACH))
>>>>>>> +                               ret = -ENOENT;
>>>>>>> +               }
>>>>>>
>>>>>> So I understand that instead of incrementing num_writes and joining
>>>>>> the current transaction, you do not join and wait for the current
>>>>>> transaction to unblock.
>>>>>
>>>>> More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not
>>>>> join and just wait for the current transaction to unblock if ->in_commit
>>>>> is set.
>>>>>
>>>>>> Which task in Josef's example
>>>>>> http://marc.info/?l=linux-btrfs&m=136070705732646&w=2
>>>>>> task 1, task 2 or task 3 is the one that will not join the
>>>>>> transaction, but instead wait?
>>>>>
>>>>> Task1 will not join the transaction, in this way, async inode flush
>>>>> won't run, and then task3 won't do anything.
>>>>>
>>>>> Before applying the patch:
>>>>> Start/Attach_Trans_Task                 Commit_Task                     
>>>>> Flush_Worker
>>>>> (Task1)                                 (Task2)                         
>>>>> (Task3)         -- the name in Josef's example
>>>>> btrfs_start_transaction()
>>>>>  |->may_wait_transaction()
>>>>>  |  (return 0)
>>>>>  |                                      btrfs_commit_transaction()
>>>>>  |                                       |->set ->in_commit and
>>>>>  |                                       |  blocked to 1
>>>>>  |                                       |->wait writers to be 1
>>>>>  |                                       |  (writers is 1)
>>>>>  |->join_transaction()                   |
>>>>>  |  (writers is 2)                       |
>>>>>  |->btrfs_commit_transaction()           |
>>>>>      |                                   |->set trans_no_join to 1
>>>>>      |                                   |  (close join transaction)
>>>>>      |->btrfs_run_ordered_operations     |
>>>>>         (Those ordered operations        |
>>>>>          are added when releasing        |
>>>>>          file)                           |
>>>>>          |->async inode flush()          |
>>>>>          |->wait_flush_comlete()         |
>>>>>                                          |                              
>>>>> work_loop()
>>>>>                                          |                               
>>>>> |->run_work()
>>>>>                                          |                                
>>>>>    |->btrfs_join_transaction()
>>>>>                                          |                                
>>>>>        |->wait_current_trans()
>>>>>                                          |->wait writers to be 1
>>>>>
>>>>> This three tasks waited for each other.
>>>>>
>>>>> After applying this patch:
>>>>> Start/Attach_Trans_Task                 Commit_Task                     
>>>>> Flush_Worker
>>>>> (Task1)                                 (Task2)                         
>>>>> (Task3)
>>>>> btrfs_start_transaction()
>>>>>  |->may_wait_transaction()
>>>>>  |  (return 0)
>>>>>  |                                      btrfs_commit_transaction()
>>>>>  |                                       |->set ->in_commit and
>>>>>  |                                       |  blocked to 1
>>>>>  |                                       |->wait writers to be 1
>>>>>  |                                       |  (writers is 1)
>>>>>  |->join_transaction() fail              |
>>>>>  |  (return -EBUSY, writers is still 1)  |
>>>>>  |->wait_current_trans()                 |
>>>>>                                          |->set trans_no_join to 1
>>>>>                                          |  (close join transaction)
>>>>>                                          |->wait writers to be 1
>>>>>                                          |->continue committing
>>>>>                                                                         
>>>>> (Task3 does nothing)
>>>>>> Also, I think I don't fully understand Josef's example. What is
>>>>>> preventing from async flushing to complete?
>>>>>> Is task 3 waiting because trans_no_join is set?
>>>>>> Is task 3 the one that actually does the delalloc flush?
>>>>>
>>>>> See above.
>>>>>
>>>>> Thanks
>>>>> Miao
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Alex.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>         } while (ret == -EBUSY);
>>>>>>>
>>>>>>>         if (ret < 0) {
>>>>>>> --
>>>>>>> 1.6.5.2
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
>>>>>>> in
>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to