Re: [ceph-users] Ceph OSD very slow startup

Gregory Farnum Tue, 14 Oct 2014 09:19:18 -0700

On Monday, October 13, 2014, Lionel Bouton <[email protected]> wrote:


> Hi,
>
> # First a short description of our Ceph setup
>
> You can skip to the next section ("Main questions") to save time and
> come back to this one if you need more context.
>
> We are currently moving away from DRBD-based storage backed by RAID
> arrays to Ceph for some of our VMs. Our focus is on resiliency and
> capacity (one VM was outgrowing the largest RAID10 we had) and not
> maximum performance (at least not yet). Our Ceph OSDs are fairly
> unbalanced because 2 are on 2 historic hosts each with 4 disks in a
> hardware RAID10 configuration and no place available for new disks in
> the chassis. 12 additional OSD are on 2 new systems with 6 disk drives
> dedicated to one OSD each (CPU and RAM configurations are nearly
> identical on the 4 hosts). All hosts are used for running VMs too, we
> took some precautions to avoid too much interference: each host has CPU
> and RAM to spare for the OSD. CPU usage exhibits some bursts on
> occasions but as we only have one or two VM on each host, they can't
> starve the OSD which have between 2 and 8 full fledge cores (4 to 16
> hardware threads) for them depending on the current load. We have at
> least 4GB of free RAM per OSD on each host at all times (including room
> for at least a 4GB OS cache).
> To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are
> clearly our current bottleneck. That said until we have additional
> hardware they allow us to maintain availability even if 2 servers are
> down (default crushmap with pool configured with 3 replicas on 3
> different hosts) and performance is acceptable (backfilling/scrubing/...
> pgs required some tuning though and I'm eagerly waiting for 0.80.7 to
> begin tests of the new io priority tunables).
> Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid
> controllers (HP Proliant systems) with battery backed memory to help
> with write bursts.
>
> The OSDs are a mix of:
> - Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0
> fixes a filesystem lockup we had with earlier kernels manifesting itself
> with concurrent accesses to several Btrfs filesystems according to
> recent lkml posts),
> - Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on
> 3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll
> have enough experience with it).
> - XFS for a minority of individual disks (with a dedicated partition for
> the journal).
> Most of them have the same history (all being created at the same time),
> only two of them have been created later (following Btrfs corruption
> and/or conversion to XFS) and are avoided when comparing behaviours.
>
> All Btrfs volumes use these mount options:
> rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery
> All OSDs use a 5GB journal.
>
> We slowly add monitoring to the setup to see what are the benefits of
> Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU
> usage, ...). One long term objective is to slowly raise the performance
> both by migrating to/adding more suitable hardware and tuning the
> software side. Detailed monitoring is supposed to help us study the
> behaviour of isolated OSDs with different settings and being warned
> early if they generate performance problems to take them out with next
> to no impact on the whole storage network (we are strong believers in
> slow, incremental and continuous change and distributed storage with
> redundancy makes it easy to implement).
>
> # Main questions
>
> The system works well but I just realised when restarting one of the 2
> large Btrfs OSD that it was very slow to rejoin the network ("ceph osd
> set noout" was used for the restart). I stopped the OSD init after 5
> minutes to investigate what was going on and didn't find any obvious
> problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able
> to starve the system by itself, ...). Next restarts took between 43s
> (nearly no concurrent disk access and warm caches after an earlier
> restart without umounting the filesystem) and 3mn57s (one VM still on
> DRBD doing ~30 IO/s on the same volume and cold caches after a
> filesystem mount).
>
> It seems that the startup time is getting longer on the 2 large Btrfs
> filesystems (the other one gives similar results: 3mn48s on the first
> try for example). I noticed that it was a bit slow a week ago but not as
> much (there was ~half as much data on them at the time). OSDs on
> individual disks don't exhibit this problem (with warm caches init
> finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but
> they are on dedicated disks with less data.
>
> With warm caches most of the time is spent between:
> "osd.<n> <osdmap> load_pgs"
> "osd.<n> <osdmap> load_pgs opened <m> pgs"
> log lines in /var/log/ceph/ceph-osd.<n>.log (m is ~650 for both OSD). So
> it seems most of the time is spent opening pgs.
>
> What could explain such long startup times? Is the OSD init doing a lot
> of random disk accesses? Is it dependant on the volume of data or the
> history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known
> performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the
> data and a similar history of disk accesses we have 1/10 the init time
> when the disks are in both cases idle)?


Something like this is my guess; we've historically seen btrfs performance
rapidly degrade under our workloads. And I imagine that your single-disk
OSDs are only seeing 100 or so PGs each?
You could perhaps turn up OSD and FileStore debugging on one of your big
nodes and one of the little ones and do a restart and compare the syscall
wait times between them to check.
-Greg



>
> # Init snap destroy errors on Btrfs question
>
> During each init on Btrfs backed OSD we get this kind of errors in
> ceph-osd logs (always come in pairs like that at the very beginning of
> the phase where the OSD opens the pgs):
>
> 2014-10-13 23:54:44.143039 7fd4267fc700  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint:
> ioctl SNAP_DESTROY got (2) No such file or directory
> 2014-10-13 23:54:44.143087 7fd4267fc700 -1
> filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap
> 'snap_21161231' got (2) No such file or directory
> 2014-10-13 23:54:44.266149 7fd4267fc700  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint:
> ioctl SNAP_DESTROY got (2) No such file or directory
> 2014-10-13 23:54:44.266189 7fd4267fc700 -1
> filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap
> 'snap_21161268' got (2) No such file or directory
>
> I suppose it is harmless (at least these OSD don't show any other
> error/warning and have been restarted and their filesystem remounted on
> numerous occasions), but I'd like to be sure: is it?
>
> Best regards,
>
> Lionel Bouton
> _______________________________________________
> ceph-users mailing list
> [email protected] <javascript:;>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSD very slow startup

Reply via email to