On Mon, 2015-11-30 at 18:58 -0600, Mark Nelson wrote:
> Hi Laurent,
>
> Wow, that's excessive! I'd see if anyone else has any tricks first, but
> if nothing else helps, running an OSD through valgrind with massif will
> probably help pinpoint what's going on. Have you tweaked the recovery
> tunables at all?
> Oh, forgot to ask, any core dumps?
> Mark
Hi Mark,
The only options we've touched are:
"osd_max_backfills": "1",
"osd_recovery_max_active": "1",
"osd_recovery_op_priority": "1",
Plus all the "noxxx" of ceph -s below.
Do you have in mind other options we could tweak?
We have no core dump yet, Mehdi is trying a heap dump on some OSD.
When looking at "ceph tell osd.X heap stats" on OSD there is nothing in
freelist all "in use", here is an OSD with 19G RAM for 2 TB disk:
25118 root 20 0 19.030g 0.014t 2568 S 38.2 46.4
8:43.55 /usr/bin/ceph-osd --cluster=ceph -i 2
-f
/dev/sdb1 1.9T 1.6T 271G 86% /var/lib/ceph/osd/ceph-2
root@g3:~# ceph tell osd.2 heap stats
osd.2 tcmalloc heap
stats:------------------------------------------------
MALLOC: 18568498424 (17708.3 MiB) Bytes in use by application
MALLOC: + 189464576 ( 180.7 MiB) Bytes in page heap freelist
MALLOC: + 210782296 ( 201.0 MiB) Bytes in central cache freelist
MALLOC: + 4416048 ( 4.2 MiB) Bytes in transfer cache freelist
MALLOC: + 29157504 ( 27.8 MiB) Bytes in thread cache freelists
MALLOC: + 80457888 ( 76.7 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 19082776736 (18198.8 MiB) Actual memory used (physical +
swap)
MALLOC: + 393216 ( 0.4 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 19083169952 (18199.1 MiB) Virtual address space used
MALLOC:
MALLOC: 1150671 Spans in use
MALLOC: 342 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via
madvise()).
Bytes released to the OS take up virtual address space but no physical
memory.
Log since latest restart of the 19G OSD below with ceph -s and ceph osd
tree (it's rarely below two OSD down, oscillate between 2 and 10 down).
Sincerely,
Laurent
2015-12-01 08:32:13.928973 7f0554016900 0 ceph version 0.94.5-164-gbf9e1b6
(bf9e1b6307692fdc50465a64590d83e3d7015c9d), process ceph-osd, pid 25118
2015-12-01 08:32:13.939219 7f0554016900 0 filestore(/var/lib/ceph/osd/ceph-2)
backend xfs (magic 0x58465342)
2015-12-01 08:32:14.071146 7f0554016900 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl
is supported and appears to work
2015-12-01 08:32:14.071163 7f0554016900 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl
is disabled via 'filestore fiemap' config option
2015-12-01 08:32:14.186866 7f0554016900 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: syncfs(2)
syscall fully supported (by glibc and kernel)
2015-12-01 08:32:14.186996 7f0554016900 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature: extsize is
supported and kernel 3.19.0-32-generic >= 3.5
2015-12-01 08:32:15.649088 7f0554016900 0 filestore(/var/lib/ceph/osd/ceph-2)
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-12-01 08:32:29.476440 7f0554016900 1 journal _open
/var/lib/ceph/osd/ceph-2/journal fd 19: 5367660544 bytes, block size 4096
bytes, directio = 1, aio = 1
2015-12-01 08:32:29.540794 7f0554016900 1 journal _open
/var/lib/ceph/osd/ceph-2/journal fd 19: 5367660544 bytes, block size 4096
bytes, directio = 1, aio = 1
2015-12-01 08:32:29.597359 7f0554016900 0 <cls> cls/hello/cls_hello.cc:271:
loading cls_hello
2015-12-01 08:32:29.623116 7f0554016900 0 osd.2 214588 crush map has features
2303210029056, adjusting msgr requires for clients
2015-12-01 08:32:29.623132 7f0554016900 0 osd.2 214588 crush map has features
2578087936000 was 8705, adjusting msgr requires for mons
2015-12-01 08:32:29.623138 7f0554016900 0 osd.2 214588 crush map has features
2578087936000, adjusting msgr requires for osds
2015-12-01 08:32:29.623158 7f0554016900 0 osd.2 214588 load_pgs
2015-12-01 08:34:44.563581 7f0554016900 0 osd.2 214588 load_pgs opened 1496 pgs
2015-12-01 08:34:44.564545 7f0546afe700 0 -- 192.168.99.253:6811/25118 >> :/0
pipe(0xad4e000 sd=27 :6811 s=0 pgs=0 cs=0 l=0 c=0xb6730000).accept failed to
getpeername (107) Transport endpoint is not connected
2015-12-01 08:34:44.564600 7f05466bf700 0 -- 192.168.99.253:6811/25118 >> :/0
pipe(0x893c000 sd=32 :6811 s=0 pgs=0 cs=0 l=0 c=0xb6730420).accept failed to
getpeername (107) Transport endpoint is not connected
2015-12-01 08:34:44.577168 7f0554016900 -1 osd.2 214588 log_to_monitors
{default=true}
2015-12-01 08:34:44.579147 7f052b4ff700 0 osd.2 214588 ignoring osdmap until
we have initialized
2015-12-01 08:34:44.579810 7f052b4ff700 0 osd.2 214588 ignoring osdmap until
we have initialized
2015-12-01 08:34:48.054636 7f05469fd700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.243:6802/8218 pipe(0x6c82000 sd=30 :38619 s=1 pgs=0 cs=0 l=0
c=0xb6730b00).connect claims to be 192.168.99.243:6802/14163 not
192.168.99.243:6802/8218 \
- wrong node!
2015-12-01 08:34:48.054691 7f05469fd700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.243:6802/8218 pipe(0x6c82000 sd=30 :38619 s=1 pgs=0 cs=0 l=0
c=0xb6730b00).fault with nothing to send, going to standby
2015-12-01 08:34:48.056487 7f053b9fa700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.251:6802/31468 pipe(0x681a000 sd=32 :0 s=1 pgs=0 cs=0 l=0
c=0xb6730dc0).fault with nothing to send, going to standby
2015-12-01 08:34:49.148364 7f0544b56700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.238:6816/1031547 pipe(0x653d000 sd=37 :0 s=1 pgs=0 cs=0 l=0
c=0xb6732ec0).fault with nothing to send, going to standby
2015-12-01 08:34:49.149146 7f053c52f700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.245:6809/17836 pipe(0x8489000 sd=38 :56560 s=1 pgs=0 cs=0 l=0
c=0xb6733020).connect claims to be 192.168.99.245:6809/29285 not
192.168.99.245:6809/1783\
6 - wrong node!
2015-12-01 08:34:49.149228 7f053c52f700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.245:6809/17836 pipe(0x8489000 sd=38 :56560 s=1 pgs=0 cs=0 l=0
c=0xb6733020).fault with nothing to send, going to standby
2015-12-01 08:34:49.151638 7f0546afe700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.251:6805/2025832 pipe(0x9397000 sd=39 :0 s=1 pgs=0 cs=0 l=0
c=0xb88fbfa0).fault with nothing to send, going to standby
2015-12-01 08:34:49.152571 7f05466bf700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.244:6807/17661 pipe(0x9b0a000 sd=40 :0 s=1 pgs=0 cs=0 l=0
c=0xb88fcc00).fault with nothing to send, going to standby
2015-12-01 08:34:49.153592 7f053c630700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.243:6811/2013633 pipe(0x6379000 sd=41 :59009 s=1 pgs=0 cs=0 l=0
c=0xb88fd860).connect claims to be 192.168.99.243:6811/14225 not
192.168.99.243:6811/20\
13633 - wrong node!
2015-12-01 08:34:49.153639 7f053c630700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.243:6811/2013633 pipe(0x6379000 sd=41 :59009 s=1 pgs=0 cs=0 l=0
c=0xb88fd860).fault with nothing to send, going to standby
2015-12-01 08:34:49.171878 7f053b8f9700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.238:6801/31478 pipe(0x55d7e000 sd=43 :0 s=1 pgs=0 cs=0 l=0
c=0xb895b080).fault with nothing to send, going to standby
2015-12-01 08:34:49.176018 7f053b5e8700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.237:6813/12186 pipe(0xb09f000 sd=48 :0 s=1 pgs=0 cs=0 l=0
c=0xb895d440).fault with nothing to send, going to standby
2015-12-01 08:34:49.183555 7f053b4e7700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.247:6803/1013467 pipe(0xafc7000 sd=49 :0 s=1 pgs=0 cs=0 l=0
c=0xb8981b20).fault with nothing to send, going to standby
2015-12-01 08:34:49.200305 7f053afcd700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.238:6809/1031512 pipe(0x994a000 sd=92 :0 s=1 pgs=0 cs=0 l=0
c=0xb891c940).fault with nothing to send, going to standby
2015-12-01 08:34:49.230606 7f053aecc700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.237:6805/1012083 pipe(0x57530000 sd=67 :0 s=1 pgs=0 cs=0 l=0
c=0xb8a2b700).fault with nothing to send, going to standby
2015-12-01 08:34:49.243292 7f053adcb700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.237:6809/12149 pipe(0x87ec000 sd=119 :0 s=1 pgs=0 cs=0 l=0
c=0xb8a29a20).fault with nothing to send, going to standby
2015-12-01 08:34:49.245789 7f053acca700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.247:6805/9657 pipe(0x7a27000 sd=145 :0 s=1 pgs=0 cs=0 l=0
c=0xb8a367e0).fault with nothing to send, going to standby
2015-12-01 08:34:49.253996 7f053abc9700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.237:6820/1012116 pipe(0xafc2000 sd=146 :0 s=1 pgs=0 cs=0 l=0
c=0xb898a6e0).fault with nothing to send, going to standby
2015-12-01 08:34:49.758783 7f053aac8700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.240:6801/20428 pipe(0x9595000 sd=100 :0 s=1 pgs=0 cs=0 l=0
c=0x272e3600).fault with nothing to send, going to standby
2015-12-01 08:34:59.347751 7f0554016900 0 osd.2 214588 done with init,
starting boot process
2015-12-01 08:36:23.792638 7f053b9fa700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.252:6808/9305 pipe(0xa18b000 sd=30 :45593 s=2 pgs=4644 cs=1 l=0
c=0x57699860).fault, initiating reconnect
2015-12-01 08:36:23.792864 7f05469fd700 0 -- 192.168.99.253:6811/25118 >>
192.168.99.252:6808/9305 pipe(0xa18b000 sd=30 :45593 s=1 pgs=4644 cs=2 l=0
c=0x57699860).fault
2015-12-01 08:36:23.795104 7f05138b0700 0 -- 192.168.99.253:0/25118 >>
192.168.99.252:6809/9305 pipe(0xb7b1d000 sd=133 :0 s=1 pgs=0 cs=0 l=1
c=0x38b475a0).fault
2015-12-01 08:36:23.795363 7f05137af700 0 -- 192.168.99.253:0/25118 >>
192.168.99.252:6810/9305 pipe(0x14a9f0000 sd=137 :0 s=1 pgs=0 cs=0 l=1
c=0x38b47de0).fault
2015-12-01 08:36:39.271805 7f054958e700 -1 osd.2 215024 heartbeat_check: no
reply from osd.1 since back 2015-12-01 08:36:18.990748 front 2015-12-01
08:36:18.990748 (cutoff 2015-12-01 08:36:19.271805)
2015-12-01 08:36:40.096838 7f051acde700 -1 osd.2 215025 heartbeat_check: no
reply from osd.1 since back 2015-12-01 08:36:18.990748 front 2015-12-01
08:36:18.990748 (cutoff 2015-12-01 08:36:20.096837)
2015-12-01 08:36:40.273052 7f054958e700 -1 osd.2 215025 heartbeat_check: no
reply from osd.1 since back 2015-12-01 08:36:18.990748 front 2015-12-01
08:36:18.990748 (cutoff 2015-12-01 08:36:20.273051)
# ceph -s
cluster 1fe74663-8dfa-486c-bb80-3bd94c90c967
health HEALTH_WARN
210 pgs backfill
7 pgs backfilling
408 pgs degraded
3831 pgs down
7901 pgs peering
5 pgs recovering
28 pgs recovery_wait
111 pgs stale
251 pgs stuck degraded
8620 pgs stuck inactive
3 pgs stuck stale
8975 pgs stuck unclean
218 pgs stuck undersized
366 pgs undersized
267 requests are blocked > 32 sec
recovery 195827/78807007 objects degraded (0.248%)
recovery 340087/78807007 objects misplaced (0.432%)
recovery 30993/12140933 unfound (0.255%)
1/49 in osds are down
noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,notieragent
flag(s) set
monmap e8: 3 mons at
{g1=192.168.99.251:6789/0,g2=192.168.99.252:6789/0,g3=192.168.99.253:6789/0}
election epoch 1374, quorum 0,1,2 g1,g2,g3
osdmap e215938: 50 osds: 48 up, 49 in; 1826 remapped pgs
flags
noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,notieragent
pgmap v35383395: 11264 pgs, 6 pools, 39368 GB data, 11856 kobjects
66149 GB used, 30846 GB / 96995 GB avail
195827/78807007 objects degraded (0.248%)
340087/78807007 objects misplaced (0.432%)
30993/12140933 unfound (0.255%)
3675 down+peering
2868 peering
2282 active+clean
1169 remapped+peering
382 activating
179 activating+remapped
101 active+remapped+wait_backfill
95 down+remapped+peering
93 active+undersized+degraded+remapped+wait_backfill
88 activating+undersized+degraded+remapped
78 active+undersized+degraded
60 activating+undersized+degraded
56 stale+down+peering
21 stale+remapped+peering
20 active+recovery_wait+degraded
16 undersized+degraded+remapped+wait_backfill+peered
12 stale+peering
10 undersized+degraded+peered
9 activating+degraded
7 activating+degraded+remapped
6 stale+activating
5 stale+down+remapped+peering
4 active+recovering+degraded
3 active+recovery_wait+undersized+degraded+remapped
3 active+undersized+degraded+remapped+backfilling
3 stale+active+undersized+degraded
3 stale+activating+undersized+degraded
2 stale+activating+undersized+degraded+remapped
2 undersized+degraded+remapped+backfilling+peered
2 active+remapped+backfilling
2 active+recovery_wait+degraded+remapped
2 active+recovery_wait+undersized+degraded
1 recovering+undersized+degraded+peered
1 recovery_wait+undersized+degraded+remapped+peered
1 stale+activating+remapped
1 remapped
1 stale+active+clean
1 stale+undersized+degraded+peered
recovery io 357 MB/s, 94 objects/s
client io 5206 MB/s rd, 3488 MB/s wr, 78940 op/s
# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-5 4.17094 root ssd
-6 0.40300 host g1-ssd
42 0.40300 osd.42 up 1.00000 1.00000
-7 0.43599 host g2-ssd
4 0.21399 osd.4 up 1.00000 1.00000
7 0.22198 osd.7 up 1.00000 1.00000
-8 0.22198 host g3-ssd
6 0.22198 osd.6 up 1.00000 1.00000
-10 0.40900 host n7-ssd
14 0.40900 osd.14 up 1.00000 1.00000
-11 0.20900 host stri-ssd
12 0.20900 osd.12 up 1.00000 1.00000
-14 0.20799 host g4-ssd
19 0.20799 osd.19 up 1.00000 1.00000
-16 0.20900 host g5-ssd
5 0.20900 osd.5 up 1.00000 1.00000
-17 0.20900 host g6-ssd
10 0.20900 osd.10 up 1.00000 1.00000
-19 0.43100 host g8-ssd
31 0.20900 osd.31 up 1.00000 1.00000
32 0.22198 osd.32 up 1.00000 1.00000
-21 0.53899 host g9-ssd
34 0.44899 osd.34 up 1.00000 1.00000
36 0.09000 osd.36 up 1.00000 1.00000
-23 0.09000 host g10-ssd
40 0.09000 osd.40 up 1.00000 1.00000
-26 0.40300 host g11-ssd
50 0.40300 osd.50 up 1.00000 1.00000
-27 0.40300 host g12-ssd
49 0.40300 osd.49 up 1.00000 1.00000
-1 94.10190 root default
-2 6.25699 host g1
0 3.57599 osd.0 up 1.00000 1.00000
20 2.68100 osd.20 up 1.00000 1.00000
-3 5.36099 host g2
1 3.57599 osd.1 up 1.00000 1.00000
21 1.78499 osd.21 up 1.00000 1.00000
-4 5.38899 host g3
16 3.57599 osd.16 up 1.00000 1.00000
2 1.81299 osd.2 up 1.00000 1.00000
-9 6.25099 host n7
8 2.68100 osd.8 up 1.00000 1.00000
27 1.78499 osd.27 up 1.00000 1.00000
29 1.78499 osd.29 up 1.00000 1.00000
-12 5.36200 host stri
15 2.68100 osd.15 up 1.00000 1.00000
13 2.68100 osd.13 up 1.00000 1.00000
-13 14.30399 host g4
17 3.57599 osd.17 up 1.00000 1.00000
23 3.57599 osd.23 up 1.00000 1.00000
22 3.57599 osd.22 down 0 1.00000
18 3.57599 osd.18 up 1.00000 1.00000
-15 8.04300 host g5
25 2.68100 osd.25 up 1.00000 1.00000
24 2.68100 osd.24 up 1.00000 1.00000
26 2.68100 osd.26 down 1.00000 1.00000
-18 6.25699 host g6
28 2.68100 osd.28 up 1.00000 1.00000
9 3.57599 osd.9 up 1.00000 1.00000
-20 7.18799 host g8
30 4.50699 osd.30 up 1.00000 1.00000
41 2.68100 osd.41 up 1.00000 1.00000
-22 7.15199 host g9
35 3.57599 osd.35 up 1.00000 1.00000
33 3.57599 osd.33 up 1.00000 1.00000
-24 8.93799 host g10
39 3.57599 osd.39 up 1.00000 1.00000
38 2.68100 osd.38 up 1.00000 1.00000
37 2.68100 osd.37 up 1.00000 1.00000
-25 5.43900 host g11
3 1.81299 osd.3 up 1.00000 1.00000
43 1.81299 osd.43 up 1.00000 1.00000
44 1.81299 osd.44 up 1.00000 1.00000
-28 8.16100 host g12
45 1.81299 osd.45 up 1.00000 1.00000
46 1.81299 osd.46 up 1.00000 1.00000
47 2.72198 osd.47 up 1.00000 1.00000
48 1.81299 osd.48 up 1.00000 1.00000
> On 11/30/2015 06:52 PM, Laurent GUERBY wrote:
> > Hi,
> >
> > We lost a disk today in our ceph cluster so we added a new machine with
> > 4 disks to replace the capacity and we activated straw1 tunable too
> > (we also tried straw2 but we quickly backed up this change).
> >
> > During recovery OSD started crashing on all of our machines
> > the issue being OSD RAM usage that goes very high, eg:
> >
> > 24078 root 20 0 27.784g 0.026t 10888 S 5.9 84.9
> > 16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f
> > /dev/sda1 2.7T 2.2T 514G 82% /var/lib/ceph/osd/ceph-41
> >
> > That's about 8GB resident RAM per TB of disk, way above
> > what we provisionned ~ 2-4 GB RAM/TB.
> >
> > We rebuilt 0.94.5 with the three memory related commits below but
> > it didn't change anything.
> >
> > Right now our cluster is unable to fully restart and recover with the
> > machines and RAM we have been working with for the past year.
> >
> > Any idea on what to look for?
> >
> > Thanks in advance,
> >
> > Sincerely,
> >
> > Laurent
> >
> > commit 296bec72649884447b59e785c345c53994df9e09
> > Author: xiexingguo <[email protected]>
> > Date: Mon Oct 26 18:38:01 2015 +0800
> >
> > FileStore: potential memory leak if _fgetattrs fails
> >
> > Memory leak happens if _fgetattrs encounters some error and simply
> > returns.
> > Fixes: #13597
> > Signed-off-by: xie xingguo <[email protected]>
> >
> > (cherry picked from commit ace7dd096b58a88e25ce16f011aed09269f2a2b4)
> >
> > commit 16aa14ab0208df568e64e2a4f7fe7692eaf6b469
> > Author: Xinze Chi <[email protected]>
> > Date: Sun Aug 2 18:36:40 2015 +0800
> >
> > bug fix: osd: do not cache unused buffer in attrs
> >
> > attrs only reference the origin bufferlist (decode from MOSDPGPush
> > or
> > ECSubReadReply message) whose size is much greater than attrs in
> > recovery.
> > If obc cache it (get_obc maybe cache the attr), this causes the
> > whole origin
> > bufferlist would not be free until obc is evicted from obc cache. So
> > rebuild
> > the bufferlist before cache it.
> >
> > Fixes: #12565
> > Signed-off-by: Ning Yao <[email protected]>
> > Signed-off-by: Xinze Chi <[email protected]>
> > (cherry picked from commit c5895d3fad9da0ab7f05f134c49e22795d5c61f3)
> >
> > commit 51ea1ca7f4a7763bfeb110957cd8a6f33b8a1422
> > Author: xiexingguo <[email protected]>
> > Date: Thu Oct 29 20:04:11 2015 +0800
> >
> > Objecter: pool_op callback may hang forever.
> >
> > pool_op callback may hang forever due to osdmap update during reply
> > handling.
> > Fixes: #13642
> > Signed-off-by: xie xingguo <[email protected]>
> >
> > (cherry picked from commit 00c6fa9e31975a935ed2bb33a099e2b4f02ad7f2)
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com