On Wed, Feb 8, 2017 at 8:07 PM, Dan van der Ster <d...@vanderster.com> wrote:
> Hi,
>
> This is interesting. Do you have a bit more info about how to identify
> a server which is suffering from this problem? Is there some process
> (xfs* or kswapd?) we'll see as busy in top or iotop.

That's my question as well. If you would be able to reproduce the
issue intentionally, it would be very helpful.

And also if you could tell us your cluster environment a bit more in
detail, it would be also helpful.

>
> Also, which kernel are you using?
>
> Cheers, Dan
>
>
> On Tue, Feb 7, 2017 at 6:59 PM, Thorvald Natvig <thorv...@medallia.com> wrote:
>> Hi,
>>
>> We've encountered a small "kernel feature" in XFS using Filestore. We
>> have a workaround, and would like to share in case others have the
>> same problem.
>>
>> Under high load, on slow storage, with lots of dirty buffers and low
>> memory, there's a design choice with unfortunate side-effects if you
>> have multiple XFS filesystems mounted, such as often is the case when
>> you have a JBOD full of drives. This results in network traffic
>> stalling, leading to OSDs failing heartbeats.
>>
>> In short, when the kernel needs to allocate memory for anything, it
>> first figures out how many pages it needs, then goes to each
>> filesystem and says "release N pages". In XFS, that's implemented as
>> follows:
>>
>> - For each AG (8 in our case):
>>   - Try to lock AG
>>   - Release unused buffers, up to N
>> - If this point is reached, and we didn't manage to release at least N
>> pages, try again, but this time wait for the lock.
>>
>> That last part is the problem; if the lock is currently held by, say,
>> another kernel thread that is currently flushing dirty buffers, then
>> the memory allocation stalls. However, we have 30 other XFS
>> filesystems that could release memory, and the kernel also has a lot
>> of non-filesystem memory that can be released.
>>
>> This manifests as OSDs going offline during high load, with other OSDs
>> claiming that the OSD stopped responding to health checks. This is
>> especially prevalent during cache tier flushing and large backfills,
>> which can put very heavy load on the write buffers, thus increasing
>> the probability of one of these events.
>> In reality, the OSD is stuck in the kernel, trying to allocate buffers
>> to build a TCP packet to answer the network message. As soon as the
>> buffers are flushed (which can take a while), the OSD recovers, but
>> now has to deal with being marked down in the monitor maps.
>>
>> The following systemtap changes the kernel behavior to not do the 
>> lock-waiting:
>>
>> probe module("xfs").function("xfs_reclaim_inodes_ag").call {
>>  $flags = $flags & 2
>> }
>>
>> Save it to a file, and run with 'stap -v -g -d kernel
>> --suppress-time-limits <path-to-file>'. We've been running this for a
>> few weeks, and the issue is completely gone.
>>
>> There was a writeup on the XFS mailing list a while ago about the same
>> issue ( http://www.spinics.net/lists/linux-xfs/msg01541.html ), but
>> unfortunately it didn't result in consensus on a patch. This problem
>> won't exist in BlueStore, so we consider the systemtap approach a
>> workaround until we're ready to deploy BlueStore.
>>
>> - Thorvald
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to