Le 14/10/2014 18:51, Lionel Bouton a écrit :
> Le 14/10/2014 18:17, Gregory Farnum a écrit :
>> On Monday, October 13, 2014, Lionel Bouton <lionel+c...@bouton.name
>> <mailto:lionel%2bc...@bouton.name>> wrote:
>>
>>     [...]
>>
>>     What could explain such long startup times? Is the OSD init doing
>>     a lot
>>     of random disk accesses? Is it dependant on the volume of data or the
>>     history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known
>>     performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the
>>     data and a similar history of disk accesses we have 1/10 the init
>>     time
>>     when the disks are in both cases idle)?
>>
>>
>> Something like this is my guess; we've historically seen btrfs
>> performance rapidly degrade under our workloads. And I imagine that
>> your single-disk OSDs are only seeing 100 or so PGs each?
>
> Yes.

In fact ~200 instead of ~600 on the big nodes in the current
configuration. So it's in the same ballpark than your estimation.

>
>> You could perhaps turn up OSD and FileStore debugging on one of your
>> big nodes and one of the little ones and do a restart and compare the
>> syscall wait times between them to check.
>> -Greg
>>

In the logs the big OSD is osd.0. The small one is osd.2. I'll call them
BIG and SMALL in the following.

Results are interesting. First there is ~8 sec on BIG to select the
Btrfs snapshot:

2014-10-14 19:04:56.290936 7f8b84516780 10
filestore(/var/lib/ceph/osd/ceph-0)  most recent snap from
<23107688,23107701> is 23107701
2014-10-14 19:04:56.290949 7f8b84516780 10
filestore(/var/lib/ceph/osd/ceph-0) mount rolling back to consistent
snap 23107701
2014-10-14 19:04:56.290955 7f8b84516780 10
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-0) rollback_to: to
'snap_23107701'
2014-10-14 19:05:04.379241 7f8b84516780  5
filestore(/var/lib/ceph/osd/ceph-0) mount op_seq is 23107701

This takes less than 1sec on SMALL :

2014-10-14 19:02:19.596158 7fd0a0d2b780 10
filestore(/var/lib/ceph/osd/ceph-2)  most recent snap from
<1602633,1602645> is 1602645 
2014-10-14 19:02:19.596176 7fd0a0d2b780 10
filestore(/var/lib/ceph/osd/ceph-2) mount rolling back to consistent
snap 1602645
2014-10-14 19:02:19.596182 7fd0a0d2b780 10
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-2) rollback_to: to
'snap_1602645' 
2014-10-14 19:02:20.311178 7fd0a0d2b780  5
filestore(/var/lib/ceph/osd/ceph-2) mount op_seq is 1602645

I assume this is the time Btrfs itself takes mounting the snapshot.

This behavior (slow Btrfs snapshot operations) seems to repeat itself :
the first checkpoint creation (which seems to involve a Btrfs snapshot)
takes more than 15s on BIG
2014-10-14 19:05:23.135960 7f8b73fff700 10
filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 15.503795,
interval was 15.866245
Less than 0.3s on SMALL:
2014-10-14 19:02:21.889467 7fd094be4700 10
filestore(/var/lib/ceph/osd/ceph-2) sync_entry commit took 0.135268,
interval was 0.276440

Following checkpoints are much faster on BIG later on though, next one:
2014-10-14 19:05:28.426263 7f8b73fff700 10
filestore(/var/lib/ceph/osd/ceph-0) sync_entry commit took 0.969734,
interval was 0.979822
and they seem to converge towards ~0.25s later

SMALL seems to converge towards ~0.07s (there is ~1/3 the data and
probably datastructures on it though so if snapshots on Btrfs are
supposed to be o(n) operations it could be normal).

I couldn't find other significant differences: the different phases I
identified in the OSD init process took more time on BIG but never more
than ~3x compared to SMALL.

It seems most of the time is spent doing or accessing snapshots. My best
guess currently is that Btrfs snapshot operations may have seen
significant speedups between 3.12.21 and 3.17.0 and that OSD init is
checkpoint(/snapshot) intensive which makes for most of the slow startup.

Current plan: wait at least a week to study 3.17.0 behavior and upgrade
the 3.12.21 nodes to 3.17.0 if all goes well.

Best regards,

Lionel Bouton
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to