Hi Chip,

On Oct 28, 2014, at 7:42 AM, Schweiss, Chip via illumos-zfs 
<[email protected]> wrote:

> Andy,
> 
> Count me as interested.  
> 
> Pool import time is a really problem in the HA setups I maintain.   While it 
> may be HA, fail over events still need careful planning because they take 
> upwards of 10 minutes.   Enough to cause application timeouts, especially web 
> applications.   

Is the long time at import or export? If the latter, then there are some 
workarounds for specific
configurations that can apply.
 -- richard

> 
> Pool discovery with over 240 devices attached typically causes service 
> timeouts at boot up also.   
> 
> Thank you for you efforts on this.   Hopefully your work can make it through 
> to a release status.
> 
> By <= 32 cores does that include hyper-threaded cores?  Obviously with the 
> new Haswell Xeon's even 32 full cores is possible on 2 CPU systems now.   
> 
> I'm currently building a new system that will have 24 cores + 24 
> hyper-threaded cores, so there may be issues.   This system will be in 
> development mode until January, I'd be happy to test code for if you need a 
> test bed.
> 
> -Chip
> 
> On Tue, Oct 28, 2014 at 5:07 AM, Andy Stormont via illumos-developer 
> <[email protected]> wrote:
> The import code in libzfs could also use some work.  If I remember correctly 
> the code that examines disk slices is O(n * n).  Though it’s broken it seems 
> to work okay on machines with <= 32 cores but with more than that you’ll 
> likely run into the assert at the end of zpool_open_func - which happens when 
> a thread examining a slice realises another thread has already marked the 
> slice as not containing a pool after it’s started.
> 
> I had a go at rewriting the code earlier this year to fix those issues which 
> you can find here:  http://cr.illumos.org/~webrev/andy_js/zpool/
> I’m not sure if this code works or even if it’s the latest version but if 
> there’s interest I can start looking at it again.
> 
> Andy.
> 
>> On 28 Oct 2014, at 08:59, Arne Jansen via illumos-developer 
>> <[email protected]> wrote:
>> 
>> Like this?
>> 
>> http://cr.illumos.org/~webrev/sensille/find_parallel_dp/
>> 
>> -Arne
>> 
>> On 10/16/2014 10:12 PM, Matthew Ahrens via illumos-developer wrote:
>>> (resend to cc a few more lists -- sorry)
>>> 
>>> I think the overall idea of this change is sound -- and that's a great
>>> performance improvement.
>>> 
>>> Can you change dmu_objset_find_parallel() to work like 
>>> dmu_objset_find_dp(), and
>>> then change zil_check_log_chain(0 and zil_claim() to work off of the
>>> dsl_dataset_t*, rather than the char*name?  Then I think the other threads 
>>> won't
>>> need to grab the namespace lock.
>>> 
>>> --matt
>>> 
>>> On Thu, Oct 16, 2014 at 12:05 PM, Arne Jansen via illumos-developer
>>> <[email protected] <mailto:[email protected]>> wrote:
>>> 
>>>    In our setup zpool import takes a very long time (about 15 minutes) due
>>>    to a large count of filesystems (> 5000).
>>>    During this time, you can see the disks at only a few percent busy in
>>>    iostat, while the pool is at 100%, which indicates a single-threaded
>>>    import process.
>>>    What's causing this is checking of the zil chains for each filesystem.
>>>    Each filesystem is visisted twice, once to check the chain and once
>>>    to claim it. This is done in a sequential manner.
>>>    I've created a proof-of-concept patch that switches this process to a
>>>    parallel enumeration using a taskq.
>>>    In my setup this speeds up the import process by a factor of 20,
>>>    bringing all disks (30 in my case) to nearly 100% busy.
>>>    There's only one problem: locking. spa_open is called with
>>>    spa_namespace_lock held. With this, later on zil_check_log_chain and
>>>    zil_claim are called. These call in turn spa_open, which tries to claim
>>>    spa_namespace_lock again. This problem is not new. The current solution
>>>    looks like this:
>>> 
>>>    /*
>>>     * As disgusting as this is, we need to support recursive calls to this
>>>     * function because dsl_dir_open() is called during spa_load(), and ends
>>>     * up calling spa_open() again.  The real fix is to figure out how to
>>>     * avoid dsl_dir_open() calling this in the first place.
>>>     */
>>>    if (mutex_owner(&spa_namespace___lock) != curthread) {
>>>            mutex_enter(&spa_namespace___lock);
>>>            locked = B_TRUE;
>>>    }
>>> 
>>>    This doesn't work anymore when I call zil_claim/check_log_chain through
>>>    a taskq and thus from a different thread. My current hacky solution is
>>>    to pass a flag through the call chain to spa_open to signal that the
>>>    lock isn't needed. It works, but it doesn't look too nice...
>>> 
>>>    My question now is how to build a cleaner solution. The snippet above
>>>    is already ugly in itself, so it would be good to get rid of it at all.
>>>    The functions involved in the call chains are all very commonly used
>>>    functions. Changing the signature of those will probably give a quite
>>>    bulky patch, so I'd prefer to find a nice small unintrusive way.
>>> 
>>>    The patch currently looks like this:
>>> 
>>>    http://cr.illumos.org/~webrev/__sensille/find_parallel/
>>>    <http://cr.illumos.org/~webrev/sensille/find_parallel/>
>>> 
>>>    Hopefully you can give me some hints on how to solve this recursive
>>>    locking riddle...
>>> 
>>>    -Arne
>>> 
>>> 
>>> 
>>>    ------------------------------__-------------
>>>    illumos-developer
>>>    Archives: https://www.listbox.com/__member/archive/182179/=now
>>>    <https://www.listbox.com/member/archive/182179/=now>
>>>    RSS Feed:
>>>    https://www.listbox.com/__member/archive/rss/182179/__21175174-cd73734d
>>>    <https://www.listbox.com/member/archive/rss/182179/21175174-cd73734d>
>>>    Modify Your Subscription:
>>>    https://www.listbox.com/__member/?&id___secret=21175174-792643f6
>>>    <https://www.listbox.com/member/?&;>
>>>    Powered by Listbox: http://www.listbox.com
>>> 
>>> 
>>> *illumos-developer* | Archives
>>> <https://www.listbox.com/member/archive/182179/=now>
>>> <https://www.listbox.com/member/archive/rss/182179/21501168-10807f51> | 
>>> Modify
>>> <https://www.listbox.com/member/?&;>
>>> Your Subscription [Powered by Listbox] <http://www.listbox.com>
>>> 
>> 
>> 
>> 
>> -------------------------------------------
>> illumos-developer
>> Archives: https://www.listbox.com/member/archive/182179/=now
>> RSS Feed: https://www.listbox.com/member/archive/rss/182179/21174975-30939194
>> Modify Your Subscription: https://www.listbox.com/member/?&;
>> Powered by Listbox: http://www.listbox.com
> 
> illumos-developer | Archives  | Modify Your Subscription       
> 
> illumos-zfs | Archives  | Modify Your Subscription     

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to