Re: [gpfsug-discuss] CTDB woes

Orlando Richards Mon, 15 Apr 2013 02:55:00 -0700

On 12/04/13 19:44, Vic Cornell wrote:

Have you tried putting the ctdb files onto a separate gpfs filesystem?

No - but considered it. However, the only "live" CTDB file that sits onGPFS is the reclock file, which - I think - is only used as theheartbeat between nodes and for the recovery process. Now, there'smileage in insulating that, certainly, but I don't think that's whatwe're suffering from here.

On a positive note - we took the steps this morning to re-initialise thectdb databases from current data, and things seem to be stable today so far.


Basically - shut down ctdb on all but one node. On all but that node, do:
mv /var/ctdb/ /var/ctdb.save.date

then start up ctdb on those nodes. Once they've come up, shut down ctdbon the last node, move /var/ctdb out the way, and restart. That bringsthem all up with freshly compacted databases.

Also, from the samba-technical mailing list came the advice to use amore recent ctdb - specifically, 1.2.61. I've got that built and readyto go (and a rebuilt samba compiled against it too), but if things proveto be stable after today's compacting, then we will probably leave it atthat and not deploy this.

Interesting that 2.0 wasn't suggested for "stable", and that the current"dev" version is 2.1.


For reference, here's the start of the thread:
https://lists.samba.org/archive/samba-technical/2013-April/091525.html

--
Orlando.


On 12 Apr 2013, at 16:43, Orlando Richards <[email protected]> wrote:

On 12/04/13 15:43, Bob Cregan wrote:

Hi Orlando,
                       We use ctdb/samba for CIFS, and CNFS for NFS
(GPFS version 3.4.0-13) . Current versions are

ctdb - 1.0.99
samba 3.5.15

Both compiled from source. We have about 300+ users normally.


We have suspicions that 3.6 has put additional "chatter" into the ctdb database 
stream, which has pushed us over the edge. Barry Evans has found that the clustered 
locking databases, in particular, prove to be a scalability/usability limit for ctdb.

We have had no issues with this setup apart from CNFS which had 2 or 3
bad moments over the last year . These have gone away since we have
fixed a bug with our 10G NIC drivers (emulex cards , kernel module
be2net) which lead to occasional dropped packets for jumbo frames. There
have been no issues with samba/ctdb

The only comment I can make is that during initial investigations into
an upgrade of samba to 3.6.x we discovered that the 3.6 code would not
compile against  ctdb 1.0.99 (compilation requires tthe ctdb source )
with error messages like:

  configure: checking whether cluster support is available
checking for ctdb.h... yes
checking for ctdb_private.h... yes
checking for CTDB_CONTROL_TRANS3_COMMIT declaration... yes
checking for CTDB_CONTROL_SCHEDULE_FOR_DELETION declaration... no
configure: error: "cluster support not available: support for
SCHEDULE_FOR_DELETION control missing"


What occurs to me is that this message seems to indicate that it is
possible to run  a ctdb version that is incompatible with samba 3.6.
  That would imply that an upgrade to a higher version of ctdb might
help, of course it might not and make backing out harder.


Certainly 1.0.114 builds fine - I've not tried 2.0, I'm too scared! The 
versioning in CTDB has proved hard for me to fathom...


A compile against ctdb 2.0 works fine. We will soon be running in this
upgrade, but I'm waiting to see what the samba  people say at the UG
meeting first!


It has to be said - the timing is good!
Cheers,
Orlando


Thanks

Bob


On 12 April 2013 13:37, Orlando Richards <[email protected]
<mailto:[email protected]>> wrote:

    Hi folks, ac <mailto:[email protected]>

    We've long been using CTDB and Samba for our NAS service, servicing
    ~500 users. We've been suffering from some problems with the CTDB
    performance over the last few weeks, likely triggered either by an
    upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a result),
    or possibly by additional users coming on with a new workload.

    We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again,
    from sernet). Before we roll back, we'd like to make sure we can't
    fix the problem and stick with Samba 3.6 (and we don't even know
    that a roll back would fix the issue).

    The symptoms are a complete freeze of the service for CIFS users for
    10-60 seconds, and on the servers a corresponding spawning of large
    numbers of CTDB processes, which seem to be created in a "big bang",
    and then do what they do and exit in the subsequent 10-60 seconds.

    We also serve up NFS from the same ctdb-managed frontends, and GPFS
    from the cluster - and these are both fine throughout.

    This was happening 5-10 times per hour, not at exact intervals
    though. When we added a third node to the CTDB cluster, it "got
    worse", and when we dropped the CTDB cluster down to a single node
    and everything started behaving fine - which is where we are now.

    So, I've got a bunch of questions!

      - does anyone know why ctdb would be spawning these processes, and
    if there's anything we can do to stop it needing to do it?
      - has anyone done any more general performance / config
    optimisation of CTDB?

    And - more generally - does anyone else actually use ctdb/samba/gpfs
    on the scale of ~500 users or higher? If so - how do you find it?


    --
                 --
        Dr Orlando Richards
       Information Services
    IT Infrastructure Division
            Unix Section
         Tel: 0131 650 4994

    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.
    _________________________________________________
    gpfsug-discuss mailing list
    [email protected] <mailto:[email protected]>
    http://gpfsug.org/mailman/__listinfo/gpfsug-discuss
    <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>




--

Bob Cregan

Senior Storage Systems Administrator

ACRC

Bristol University

Tel:     +44 (0) 117 331 4406

skype:  bobcregan

Mobile: +44 (0) 7712388129



--
            --
   Dr Orlando Richards
  Information Services
IT Infrastructure Division
       Unix Section
    Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
_______________________________________________
gpfsug-discuss mailing list
[email protected]
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
[email protected]
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



--
            --
   Dr Orlando Richards
  Information Services
IT Infrastructure Division
       Unix Section
    Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered inScotland, with registration number SC005336.

_______________________________________________
gpfsug-discuss mailing list
[email protected]
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] CTDB woes

Reply via email to