On 02/26/2013 01:00 PM, Sage Weil wrote:
> On Tue, 26 Feb 2013, Yan, Zheng wrote:
>>> It looks to me like truncates can get queued for later, so that's not the 
>>> case?
>>> And how could the client receive a truncate while in the middle of
>>> writing? Either it's got the write caps (in which case nobody else can
>>> truncate), or it shouldn't be writing...I suppose if somebody enables
>>> lazyIO that would do it, but again in that case I imagine we'd rather
>>> do the truncate before writing:
>>
>> Commit 22cddde104 make buffered write get/put caps for individual page. So 
>> the MDS can
>> revoke the write caps in the middle of write.
>>
>>>
>>>> I think it's better to do vmtruncate after write finishes.
>>> What if you're writing past the truncate point? Then your written data
>>> would be truncated off even though it should have been a file
>>> extension.
>>>
>> I was wrong, I thought the order of write and truncate is not important. 
>> Now I'm worry about the correctness of commit 22cddde104, we probably 
>> should revert that commit.
> 
> I'm sorry I haven't had time to dig into this thread.  Do you think we 
> should revert that commit for this merge window?

No, that commit fixes a i_size update bug, it is more important than the 
deadlock.
But I think the eventual fix is revert that commit and also remove the 
optimization
that drops Fw caps early (to avoid slow cap revocation caused by 
balance_dirty_pages)

> 
> My gut feeling is that the whole locking scheme here needs to be reworked.  
> I suspect where we'll end up is something more like XFS, where there is an 
> explicit truncate (or other) mutex that is separate from i_mutex.  The 
> fact that we were relying on i_mutex serialization from the VFS was 
> fragile and (as we can see) ultimately a flawed approach.  And the locking 
> and races between writes, mmap, and truncation in the page cache code are 
> fragile and annoying.  The only good thing going for us is that fsx is 
> pretty good at stress testing the code.  :)

If we want to keep write operation atomic, write and vmtruncate need to be
protected by a mutex. I don't think introducing a new mutex makes thing simple.

> 
>>> Oh, you're right about the locking (I think, anyway?). However, i_size 
>>> doesn't protect us ? we might for instance have truncated to zero and 
>>> then back up to the original file size. :( But that doesn't look like 
>>> it's handled correctly anyway (racy) ? am I misreading that badly or 
>>> is the infrastructure in fact broken in that case?
>>
>> You are right. probably we can fix the race by disallowing both read and 
>> write when there is pending truncate.
> 
> FWIW, I think the high-level strategy before should still be basically 
> sound: we can queue a truncation without blocking, and any write path will 
> apply a pending truncation under the appropriate lock.  But I think it's 
> going to mean carefully documenting the locking requirements for the 
> various paths and working out a new locking structure.
> 
> Yan, is this something you are able to put some time into?  I would like 
> to work on it, but it's going to be hard for me to find the time in the 
> next several weeks to get to it.

OK, I will work on it.

Regards
Yan, Zheng

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to