>  Tridge found the (already noted) related bug on our system and conceded it
> was a design flaw. Apparently each new smbd process that starts, does a
> quick traversal of the tdb databases to clean out any stale entries, and on
> Solaris, these are taking too long. 

        I've found a bunch of fixed bugs on fcntl performance,
        implying it's been even slower in the past (:-))

>Ok - discussed this with Andrew last night. It seems that this is only
>a problem on Solaris. Solaris seems to have *serious* issues with fcntl
>locks with multiple processes contending for locks. No other system we
>run on seems to have this problem (they have their own problems :-).

        At the expense of not addressing the Sun side of the
        problem, might I suggest that validation operations
        shouldn't lock?  
        
        Throwing my mind into a past life with safety-critical  
        real-time, I opine that the check without locks will
        1) succeed in bounded time dependent on the number 
                of structures traversed & checked
        2) fail because the structures are invalid (in this case
                stale) in bounded time, at which point one
                chooses to take a lock and remove them.
        3) fail in bounded time because the structures were
                changed by a program using locking, and the 
                non-locked program is seeing changing data. 
                In this case we elect to try to take a lock,
                fail because it's already held, wait interminably
                for it to complete, get the lock, and
                a) find it's done and exit
                b) find it still needs to be done and do it.
        The third is interesting because the other threads or
        processes are delaying us some amount before we get to
        do any work.  This, you might imagine, is a problem when
        you try to demonstrate correctness within lime limits (;-))

        I haven't looked at the code, but if it uses F_SETLKW
        you might want to do a trylock first, implemented via
        F_GETLK or F_SETLK, as this would allow subsequent
        processes to continue, knowing that someone's fixing
        the tdb, and that they can access it later using the
        normal locking regime.

> >
> >Dave CB - can you investigate this within Sun please. This is a *critical*
> >part of Samba, we may have to look into a solaris-specific workaround and
> >this would be bad.

        Bad is an understatement...
 
--dave
-- 
David Collier-Brown,           | Always do right. This will gratify 
Performance & Engineering      | some people and astonish the rest.
Americas Customer Engineering, |                      -- Mark Twain
(905) 415-2849                 | [EMAIL PROTECTED]

Reply via email to