----- Original Message -----
From: "Matthew Ahrens" <[email protected]>
To: "illumos-zfs" <[email protected]>; "Steven Hartland"
<[email protected]>
Cc: "developer" <[email protected]>
Sent: Sunday, September 14, 2014 6:31 PM
Subject: Re: [zfs] ZFS Write Throttle Dirty Data Limit Interation with Free
Memory
On Sun, Sep 14, 2014 at 10:00 AM, Steven Hartland via illumos-zfs <
[email protected]> wrote:
We've been investigating a problem with stalls on FreeBSD when
using ZFS and one of the current theories which is producing some
promising results is within the new IO scheduler, specifically
around how the dirty data limit being static limit.
The stalls occur when memory is close to the low water mark around
where paging will be triggered. At this time if there is a burst of
write IO, such as a copy from a remote location, ZFS can rapidly
allocate memory until the dirty data limit is hit.
This rapid memory consumption exacerbates the low memory situation
resulting in increased swapping and more stalls to the point where
the machine can be essentially become unusable for a good period
of time.
I will say its not clear if this only effects FreeBSD due to the
variations in how the VM interacts with ZFS or not.
Karl one of the FreeBSD community members who has been suffering
from this issue on his production environments, has been playing
with recalculating zfs_dirty_data_max at the start of
dmu_tx_assign(..) to take into account free memory.
While this has produced good results in his environment, eliminating
the stalls totally while keep IO usage high, its not clear if the
variation of zfs_dirty_data_max could have undesired side effects.
Given both Adam and Matt read these lists I thought it would be an
ideal place to raise this issue and get expert feedback on this
problem and potential ways of addressing it.
So the questions:
1. Is this a FreeBSD only issue or could other implementations
suffer from similar memory starvation situation due to rapid
consumption until dirty data max is hit?
2. Should dirty max or its consumers be made memory availability
aware to ensure that swapping due to IO busts are avoided?
This is probably the wrong solution.
Are you sure that this only happens when writing, and not when reading?
All arc buffer allocation (including for writing) should go through
arc_get_data_buf(), which will evict from the ARC to make room for the new
buffer if necessary, based on arc_evict_needed().
The load is a mixture of reads and write, with the trigger in this test being
a large amount of writes over samba by a backup process, so that doesn't
mean that reads aren't a trigger for this ever.
We've been investigating ARC allocation quite a bit and ARC does indeed
get pushed back. Adjusting ARC's target for fee has helped but any
significant adjustment on that has been demonstrated to cause other
issues, such as ARC pushed back to min for a considerable amount of
time, if not indefinitely as the VM never sees any pressure hence doesn't
scan INACT entries.
With regards to buffers being allocated by arc_get_data_buf() I can't see
a path by which ARC will prevent a new buffer being allocated even when
arc_evict_needed().
If thats the case can't we hit min ARC but yet still claim new buffers? If
so we can suddenly demand up to 10% of the system memory all of which
may required VM to page before it can provide said memory.
Regards
Steve
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer