Re: [ceph-users] Mimic osd fails to start.
Thanks for the all the help. For some bizarre reason I had an empty host inside default root. Once I dumped a "fake" osd into it everything started working. On Mon, Aug 20, 2018 at 7:36 PM Daznis wrote: > > Hello, > > > Medic shows everything fine. Whole cluster is on the latest mimic > version. It was updated to mimic when stable version of mimic was > release and recently it was updated to "ceph version 13.2.1 > (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)". For some > reason one mgr service is running, but it's not connected to the > cluster. > > Versions output: > > { > "mon": { > "ceph version 13.2.1 > (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 3 > }, > "mgr": { > "ceph version 13.2.1 > (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 2 > }, > "osd": { > "ceph version 13.2.1 > (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 47 > }, > "mds": {}, > "overall": { > "ceph version 13.2.1 > (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 52 > } > } > > Medic output: > === Starting remote check session > > Version: 1.0.4Cluster Name: "ceph" > Total hosts: [10] > OSDs:5MONs:3 Clients:0 > MDSs:0RGWs:0 MGRs: 2 > > > > -- managers -- > mon03 > mon02 > mon01 > > ---- osds > node03 > node02 > node01 > node05 > node04 > > mons > mon01 > mon03 > mon02 > > 107 passed, on 11 hosts > On Mon, Aug 20, 2018 at 6:13 PM Alfredo Deza wrote: > > > > On Mon, Aug 20, 2018 at 10:23 AM, Daznis wrote: > > > Hello, > > > > > > It appears that something is horribly wrong with the cluster itself. I > > > can't create or add any new osds to it at all. > > > > Have you added new monitors? Or replaced monitors? I would check that > > all your versions match, something seems to be expecting different > > versions. > > > > The "Invalid argument" problem is a common thing we see when that happens. > > > > Something that might help a bit here is if you run ceph-medic against > > your cluster: > > > > http://docs.ceph.com/ceph-medic/master/ > > > > > > > > > On Mon, Aug 20, 2018 at 11:04 AM Daznis wrote: > > >> > > >> Hello, > > >> > > >> > > >> Zapping the journal didn't help. I tried to create the journal after > > >> zapping it. Also failed. I'm not really sure why this happens. > > >> > > >> Looking at the monitor logs with 20/20 debug I'm seeing these errors: > > >> > > >> 2018-08-20 08:57:58.753 7f9d85934700 0 mon.mon02@1(peon) e4 > > >> handle_command mon_command({"prefix": "osd crush set-device-class", > > >> "class": "ssd", "ids": ["48"]} v 0) v1 > > >> 2018-08-20 08:57:58.753 7f9d85934700 20 is_capable service=osd > > >> command=osd crush set-device-class read write on cap allow profile osd > > >> 2018-08-20 08:57:58.753 7f9d85934700 20 allow so far , doing grant > > >> allow profile osd > > >> 2018-08-20 08:57:58.753 7f9d85934700 20 match > > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon) e4 > > >> _allowed_command capable > > >> 2018-08-20 08:57:58.753 7f9d85934700 0 log_channel(audit) log [INF] : > > >> from='osd.48 10.24.52.17:6800/153683' entity='osd.48' cmd=[{"prefix": > > >> "osd crush set-device-class", "class": "ssd", "ids": ["48"]}]: > > >> dispatch > > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon).osd e46327 > > >> preprocess_query mon_command({"prefix": "osd crush set-device-class", > > >> "class": "ssd", "ids": ["48"]} v 0) v1 from osd.48 > > >> 10.24.52.17:6800/153683 > > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon) e4 > > >> forward_request 4 request mon_command({"prefix": "osd crush > > >> set-device-class", "class": "ssd", "ids": ["48"]} v 0)
Re: [ceph-users] Mimic osd fails to start.
Hello, Medic shows everything fine. Whole cluster is on the latest mimic version. It was updated to mimic when stable version of mimic was release and recently it was updated to "ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)". For some reason one mgr service is running, but it's not connected to the cluster. Versions output: { "mon": { "ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 3 }, "mgr": { "ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 2 }, "osd": { "ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 47 }, "mds": {}, "overall": { "ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)": 52 } } Medic output: === Starting remote check session Version: 1.0.4Cluster Name: "ceph" Total hosts: [10] OSDs:5MONs:3 Clients:0 MDSs:0RGWs:0 MGRs: 2 -- managers -- mon03 mon02 mon01 osds node03 node02 node01 node05 node04 mons mon01 mon03 mon02 107 passed, on 11 hosts On Mon, Aug 20, 2018 at 6:13 PM Alfredo Deza wrote: > > On Mon, Aug 20, 2018 at 10:23 AM, Daznis wrote: > > Hello, > > > > It appears that something is horribly wrong with the cluster itself. I > > can't create or add any new osds to it at all. > > Have you added new monitors? Or replaced monitors? I would check that > all your versions match, something seems to be expecting different > versions. > > The "Invalid argument" problem is a common thing we see when that happens. > > Something that might help a bit here is if you run ceph-medic against > your cluster: > > http://docs.ceph.com/ceph-medic/master/ > > > > > On Mon, Aug 20, 2018 at 11:04 AM Daznis wrote: > >> > >> Hello, > >> > >> > >> Zapping the journal didn't help. I tried to create the journal after > >> zapping it. Also failed. I'm not really sure why this happens. > >> > >> Looking at the monitor logs with 20/20 debug I'm seeing these errors: > >> > >> 2018-08-20 08:57:58.753 7f9d85934700 0 mon.mon02@1(peon) e4 > >> handle_command mon_command({"prefix": "osd crush set-device-class", > >> "class": "ssd", "ids": ["48"]} v 0) v1 > >> 2018-08-20 08:57:58.753 7f9d85934700 20 is_capable service=osd > >> command=osd crush set-device-class read write on cap allow profile osd > >> 2018-08-20 08:57:58.753 7f9d85934700 20 allow so far , doing grant > >> allow profile osd > >> 2018-08-20 08:57:58.753 7f9d85934700 20 match > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon) e4 > >> _allowed_command capable > >> 2018-08-20 08:57:58.753 7f9d85934700 0 log_channel(audit) log [INF] : > >> from='osd.48 10.24.52.17:6800/153683' entity='osd.48' cmd=[{"prefix": > >> "osd crush set-device-class", "class": "ssd", "ids": ["48"]}]: > >> dispatch > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon).osd e46327 > >> preprocess_query mon_command({"prefix": "osd crush set-device-class", > >> "class": "ssd", "ids": ["48"]} v 0) v1 from osd.48 > >> 10.24.52.17:6800/153683 > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon) e4 > >> forward_request 4 request mon_command({"prefix": "osd crush > >> set-device-class", "class": "ssd", "ids": ["48"]} v 0) v1 features > >> 4611087854031142907 > >> 2018-08-20 08:57:58.753 7f9d85934700 20 mon.mon02@1(peon) e4 > >> _ms_dispatch existing session 0x55b4ec482a80 for mon.1 > >> 10.24.52.11:6789/0 > >> 2018-08-20 08:57:58.753 7f9d85934700 20 mon.mon02@1(peon) e4 caps allow * > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon).log > >> v10758065 preprocess_query log(1 entries from seq 4 at 2018-08-20 > >> 08:57:58.755306) v1 from mon.1 10.24.52.11:6789/0 > >> 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon).log > >> v10758065 preprocess_log log(1 entries from seq 4 at 2018-08-20 > >> 08:57:58.755306) v1 from mon.1 > >> 2018-08-20 08:57:58.753 7f9d859347
Re: [ceph-users] Mimic osd fails to start.
Hello, It appears that something is horribly wrong with the cluster itself. I can't create or add any new osds to it at all. On Mon, Aug 20, 2018 at 11:04 AM Daznis wrote: > > Hello, > > > Zapping the journal didn't help. I tried to create the journal after > zapping it. Also failed. I'm not really sure why this happens. > > Looking at the monitor logs with 20/20 debug I'm seeing these errors: > > 2018-08-20 08:57:58.753 7f9d85934700 0 mon.mon02@1(peon) e4 > handle_command mon_command({"prefix": "osd crush set-device-class", > "class": "ssd", "ids": ["48"]} v 0) v1 > 2018-08-20 08:57:58.753 7f9d85934700 20 is_capable service=osd > command=osd crush set-device-class read write on cap allow profile osd > 2018-08-20 08:57:58.753 7f9d85934700 20 allow so far , doing grant > allow profile osd > 2018-08-20 08:57:58.753 7f9d85934700 20 match > 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon) e4 > _allowed_command capable > 2018-08-20 08:57:58.753 7f9d85934700 0 log_channel(audit) log [INF] : > from='osd.48 10.24.52.17:6800/153683' entity='osd.48' cmd=[{"prefix": > "osd crush set-device-class", "class": "ssd", "ids": ["48"]}]: > dispatch > 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon).osd e46327 > preprocess_query mon_command({"prefix": "osd crush set-device-class", > "class": "ssd", "ids": ["48"]} v 0) v1 from osd.48 > 10.24.52.17:6800/153683 > 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon) e4 > forward_request 4 request mon_command({"prefix": "osd crush > set-device-class", "class": "ssd", "ids": ["48"]} v 0) v1 features > 4611087854031142907 > 2018-08-20 08:57:58.753 7f9d85934700 20 mon.mon02@1(peon) e4 > _ms_dispatch existing session 0x55b4ec482a80 for mon.1 > 10.24.52.11:6789/0 > 2018-08-20 08:57:58.753 7f9d85934700 20 mon.mon02@1(peon) e4 caps allow * > 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon).log > v10758065 preprocess_query log(1 entries from seq 4 at 2018-08-20 > 08:57:58.755306) v1 from mon.1 10.24.52.11:6789/0 > 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon).log > v10758065 preprocess_log log(1 entries from seq 4 at 2018-08-20 > 08:57:58.755306) v1 from mon.1 > 2018-08-20 08:57:58.753 7f9d85934700 20 is_capable service=log > command= write on cap allow * > 2018-08-20 08:57:58.753 7f9d85934700 20 allow so far , doing grant allow * > 2018-08-20 08:57:58.753 7f9d85934700 20 allow all > 2018-08-20 08:57:58.753 7f9d85934700 10 mon.mon02@1(peon) e4 > forward_request 5 request log(1 entries from seq 4 at 2018-08-20 > 08:57:58.755306) v1 features 4611087854031142907 > 2018-08-20 08:57:58.754 7f9d85934700 20 mon.mon02@1(peon) e4 > _ms_dispatch existing session 0x55b4ec4828c0 for mon.0 > 10.24.52.10:6789/0 > 2018-08-20 08:57:58.754 7f9d85934700 20 mon.mon02@1(peon) e4 caps allow * > 2018-08-20 08:57:58.754 7f9d85934700 20 is_capable service=mon > command= read on cap allow * > 2018-08-20 08:57:58.754 7f9d85934700 20 allow so far , doing grant allow * > 2018-08-20 08:57:58.754 7f9d85934700 20 allow all > 2018-08-20 08:57:58.754 7f9d85934700 20 is_capable service=mon > command= exec on cap allow * > 2018-08-20 08:57:58.754 7f9d85934700 20 allow so far , doing grant allow * > 2018-08-20 08:57:58.754 7f9d85934700 20 allow all > 2018-08-20 08:57:58.754 7f9d85934700 10 mon.mon02@1(peon) e4 > handle_route mon_command_ack([{"prefix": "osd crush set-device-class", > "class": "ssd", "ids": ["48"]}]=-22 (22) Invalid argument v46327) v1 > to unknown.0 - > 2018-08-20 08:57:58.785 7f9d85934700 10 mon.mon02@1(peon) e4 > ms_handle_reset 0x55b4ecf4b200 10.24.52.17:6800/153683 > 2018-08-20 08:57:58.785 7f9d85934700 10 mon.mon02@1(peon) e4 > reset/close on session osd.48 10.24.52.17:6800/153683 > 2018-08-20 08:57:58.785 7f9d85934700 10 mon.mon02@1(peon) e4 > remove_session 0x55b4ecf86380 osd.48 10.24.52.17:6800/153683 features > 0x3ffddff8ffa4fffb > 2018-08-20 08:57:58.828 7f9d85934700 20 mon.mon02@1(peon) e4 > _ms_dispatch existing session 0x55b4ec4828c0 for mon.0 > 10.24.52.10:6789/0 > On Sat, Aug 18, 2018 at 7:54 PM Daznis wrote: > > > > Hello, > > > > not sure about it. I assumed ceph-deploy would do it with the > > "--zap-disk" flag defined. I will try it on Monday and report the > > progress. > > On Sat, Aug 18, 2018 at 3:02 PM Alfredo Deza wrote: > > > > > > On Fri, Aug 17, 2018 at 7:05 PM, Daznis wrote: > > > > Hello, > > > > > >
[ceph-users] Mimic osd fails to start.
Hello, I have replace one of our failed OSD drives and recreated a new osd with ceph-deploy and it failes to start. Command: ceph-deploy --overwrite-conf osd create --filestore --zap-disk --data /dev/bcache0 --journal /dev/nvme0n1p13 Output off ceph-deploy: [ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (2.0.1): /usr/bin/ceph-deploy --overwrite-conf osd create --filestore --zap-disk --data /dev/bcache0 --journal /dev/nvme0n1p13 [ceph_deploy.cli][INFO ] ceph-deploy options: [ceph_deploy.cli][INFO ] verbose : False [ceph_deploy.cli][INFO ] bluestore : None [ceph_deploy.cli][INFO ] cd_conf : [ceph_deploy.cli][INFO ] cluster : ceph [ceph_deploy.cli][INFO ] fs_type : xfs [ceph_deploy.cli][INFO ] block_wal : None [ceph_deploy.cli][INFO ] default_release : False [ceph_deploy.cli][INFO ] username : None [ceph_deploy.cli][INFO ] journal : /dev/nvme0n1p13 [ceph_deploy.cli][INFO ] subcommand: create [ceph_deploy.cli][INFO ] host : [ceph_deploy.cli][INFO ] filestore : True [ceph_deploy.cli][INFO ] func : [ceph_deploy.cli][INFO ] ceph_conf : None [ceph_deploy.cli][INFO ] zap_disk : True [ceph_deploy.cli][INFO ] data : /dev/bcache0 [ceph_deploy.cli][INFO ] block_db : None [ceph_deploy.cli][INFO ] dmcrypt : False [ceph_deploy.cli][INFO ] overwrite_conf: True [ceph_deploy.cli][INFO ] dmcrypt_key_dir : /etc/ceph/dmcrypt-keys [ceph_deploy.cli][INFO ] quiet : False [ceph_deploy.cli][INFO ] debug : False [ceph_deploy.osd][DEBUG ] Creating OSD on cluster ceph with data device /dev/bcache0 [][DEBUG ] connected to host: [][DEBUG ] detect platform information from remote host [][DEBUG ] detect machine type [][DEBUG ] find the location of an executable [ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.5.1804 Core [ceph_deploy.osd][DEBUG ] Deploying osd to [][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [][DEBUG ] find the location of an executable [ceph_deploy.osd][WARNIN] zapping is no longer supported when preparing [][INFO ] Running command: /usr/sbin/ceph-volume --cluster ceph lvm create --filestore --data /dev/bcache0 --journal /dev/nvme0n1p13 [][DEBUG ] Running command: /bin/ceph-authtool --gen-print-key [][DEBUG ] Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new a503ae5e-b5b9-40d7-b8b3-194f15e52082 [][DEBUG ] Running command: /usr/sbin/vgcreate --force --yes ceph-a1ffe5bb-6f06-49c6-8aec-e3eb3a311162 /dev/bcache0 [][DEBUG ] stdout: Physical volume "/dev/bcache0" successfully created. [][DEBUG ] stdout: Volume group "ceph-a1ffe5bb-6f06-49c6-8aec-e3eb3a311162" successfully created [][DEBUG ] Running command: /usr/sbin/lvcreate --yes -l 100%FREE -n osd-data-a503ae5e-b5b9-40d7-b8b3-194f15e52082 ceph-a1ffe5bb-6f06-49c6-8aec-e3eb3a311162 [][DEBUG ] stdout: Logical volume "osd-data-a503ae5e-b5b9-40d7-b8b3-194f15e52082" created. [][DEBUG ] Running command: /bin/ceph-authtool --gen-print-key [][DEBUG ] Running command: /usr/sbin/mkfs -t xfs -f -i size=2048 /dev/ceph-a1ffe5bb-6f06-49c6-8aec-e3eb3a311162/osd-data-a503ae5e-b5b9-40d7-b8b3-194f15e52082 [][DEBUG ] stdout: meta-data=/dev/ceph-a1ffe5bb-6f06-49c6-8aec-e3eb3a311162/osd-data-a503ae5e-b5b9-40d7-b8b3-194f15e52082 isize=2048 agcount=4, agsize=244154112 blks [][DEBUG ] = sectsz=512 attr=2, projid32bit=1 [][DEBUG ] = crc=1 finobt=0, sparse=0 [][DEBUG ] data = bsize=4096 blocks=976616448, imaxpct=5 [][DEBUG ] = sunit=0 swidth=0 blks [][DEBUG ] naming =version 2 bsize=4096 ascii-ci=0 ftype=1 [][DEBUG ] log =internal log bsize=4096 blocks=476863, version=2 [][DEBUG ] = sectsz=512 sunit=0 blks, lazy-count=1 [][DEBUG ] realtime =none extsz=4096 blocks=0, rtextents=0 [][DEBUG ] Running command: /bin/mount -t xfs -o rw,noatime,inode64,noquota,nodiratime,logbufs=8,logbsize=256k,attr2 /dev/ceph-a1ffe5bb-6f06-49c6-8aec-e3eb3a311162/osd-data-a503ae5e-b5b9-40d7-b8b3-194f15e52082 /var/lib/ceph/osd/ceph-48 [][DEBUG ] Running command: /bin/chown -R ceph:ceph /dev/nvme0n1p13 [][DEBUG ] Running command: /bin/ln -s /dev/nvme0n1p13 /var/lib/ceph/osd/ceph-48/journal [][DEBUG ] Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
Re: [ceph-users] limited disk slots - should I ran OS on SD card ?
Hi, We used a PXE boot with NFS server, but had some issues if NFS server crapped out and dropped connections or needed a reboot for maintenance. If I remember it correctly it sometimes took out some of the rebooted servers. So we switched to PXE with livecd based images. You basically create a livecd image, then boot it with specially prepared initramfs image and it uses a copy on write disk for basic storage. With mimic osd's are started automatically, just need to feed some basic settings for that server. On Fri, Aug 17, 2018 at 11:31 AM Florian Florensa wrote: > > What about PXE booting the OSD's server ? I am considering doing these > sort of things as it doesn't seem that complicated. > A simple script could easily bring the osd back onine using some lvm > commands to bring the lvm back online and then some ceph-lvm activate > command to fire the osd's back up. > > > 2018-08-15 16:09 GMT+02:00 Götz Reinicke : > > Hi, > > > >> Am 15.08.2018 um 15:11 schrieb Steven Vacaroaia : > >> > >> Thank you all > >> > >> Since all concerns were about reliability I am assuming performance > >> impact of having OS running on SD card is minimal / negligible > > > > some time ago we had a some Cisco Blades booting VMware esxi from SD cards > > and hat no issue for month …till after an update the blade was rebooted and > > the SD failed …and then an other on an other server … From my POV at that > > time the „server" SDs where not close as reliable as SSDs or rotating > > disks. My experiences from some years ago. > > > >> > >> In other words, an OSD server is not writing/reading from Linux OS > >> partitions too much ( especially with logs at minimum ) > >> so its performance is not dependent on what type of disk OS resides on > > > > Regarding performance: What kind of SDs are supported? You can get some > > "SDXCTM | UHS-II | U3 | Class 10 | V90“ which can handle up to 260 > > MBytes/sec; like „Angelbird Matchpack EVA1“ ok they are Panasonic 4K Camera > > certified (and we use them currently to record 4K video) > > > > https://www.angelbird.com/prod/match-pack-for-panasonic-eva1-1836/ > > > > My2cents . Götz > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Reducing placement groups.
Hello, I remember watching one of the ceph monthly videos on youtube and there was a talk that pg_num would be available in mimic, but I can't find any info about it? Was this feature delayed? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Strange OSD crash starts other osd flapping
Hello, Yesterday I have encountered a strange osd crash which led to cluster flapping. I had to force nodown flag on the cluster to finish the flapping. The first osd that crashed with: 2018-08-02 17:23:23.275417 7f87ec8d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8803dfb700' had timed out after 15 2018-08-02 17:23:23.275425 7f87ec8d7700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8805dff700' had timed out after 15 2018-08-02 17:25:38.902142 7f8829df0700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f8803dfb700' had suicide timed out after 150 2018-08-02 17:25:38.907199 7f8829df0700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f8829df0700 time 2018-08-02 17:25:38.902354 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55872911fb65] 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x55872905e8f1] 3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x55872905f14e] 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x55872905f92c] 5: (CephContextServiceThread::entry()+0x15b) [0x55872913790b] 6: (()+0x7e25) [0x7f882dc71e25] 7: (clone()+0x6d) [0x7f882c2f8bad] Then other osds started restarting with messages like this: 2018-08-02 17:37:14.859272 7f4bd31fe700 0 osd.44 184343 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.00 seconds, shutting down 2018-08-02 17:37:14.870121 7f4bd31fe700 0 osd.44 184343 _committed_osd_maps shutdown OSD via async signal 2018-08-02 17:37:14.870159 7f4bb9618700 -1 osd.44 184343 *** Got signal Interrupt *** 2018-08-02 17:37:14.870167 7f4bb9618700 0 osd.44 184343 prepare_to_stop starting shutdown There is a 10k line event dump with the first osd crash. I have looked thru it and nothing strange stuck with me. Any suggestions what I should be looking for in it? I have checked nodes dmesg and switch port logs. No info on flapping ports or interface and completely no errors with disk. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph issue too many open files.
Hi, Recently about ~2 weeks ago something strange started happening with one of the ceph cluster I'm managing. It's running ceph jewel 10.2.10 with cache layer. Some OSD's started crashing with "too many open files error". From looking at the issue I have found that it keeps a lot of links in /proc/self/fd and once 1 mil limit is reached it crashes. I have tried increasing the limit to 2 mil, but same thing happened. The problem with this is that it's not clearing /proc/self/fd as there is about 900k inodes used inside the OSD drive. Once the OSD is restarted and scrub starts I'm getting missing shard errors: 2018-07-15 18:32:26.554348 7f604ebd1700 -1 log_channel(cluster) log [ERR] : 6.58 shard 51 missing 6:1a3a2565:::rbd_data.314da9e52da0f2.d570:head OSD crash log: -4> 2018-07-15 17:40:25.566804 7f97143fe700 0 filestore(/var/lib/ceph/osd/ceph-44) error (24) Too many open files not handled on operation 0x7f970e0274c0 (5142329351.0.0, or op 0, counting from 0) -3> 2018-07-15 17:40:25.566825 7f97143fe700 0 filestore(/var/lib/ceph/osd/ceph-44) unexpected error code -2> 2018-07-15 17:40:25.566829 7f97143fe700 0 filestore(/var/lib/ceph/osd/ceph-44) transaction dump: { "ops": [ { "op_num": 0, "op_name": "touch", "collection": "6.f0_head", "oid": "#-8:0f00:::temp_6.f0_0_55255967_2688:head#" }, { "op_num": 1, "op_name": "write", "collection": "6.f0_head", "oid": "#-8:0f00:::temp_6.f0_0_55255967_2688:head#", "length": 65536, "offset": 0, "bufferlist length": 65536 }, { "op_num": 2, "op_name": "omap_setkeys", "collection": "6.f0_head", "oid": "#6:0f00head#", "attr_lens": { "_info": 925 } } ] } -1> 2018-07-15 17:40:25.566886 7f97143fe700 -1 dump_open_fds unable to open /proc/self/fd 0> 2018-07-15 17:40:25.569564 7f97143fe700 -1 os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f97143fe700 time 2018-07-15 17:40:25.566888 os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error") Any insight on how to fix this issue is appreciated. Regards, Darius ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cache tier removal.
Hello, On Tue, Jan 10, 2017 at 11:11 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Daznis >> Sent: 09 January 2017 12:54 >> To: ceph-users <ceph-users@lists.ceph.com> >> Subject: [ceph-users] Ceph cache tier removal. >> >> Hello, >> >> >> I'm running preliminary test on cache tier removal on a live cluster, before >> I try to do that on a production one. I'm trying to > avoid >> downtime, but from what I noticed it's either impossible or I'm doing >> something wrong. My cluster is running Centos 7.2 and 0.94.9 >> ceph. >> >> Example 1: >> I'm setting the cache layer to forward. >> 1. ceph osd tier cache-mode test-cache forward . >> Then flushing the cache: >> 1. rados -p test-cache cache-flush-evict-all Then I'm getting stuck >> with the some objects that can't be removed: >> >> rbd_header.29c3cdb2ae8944a >> failed to evict /rbd_header.29c3cdb2ae8944a: (16) Device or resource busy >> rbd_header.28c96316763845e >> failed to evict /rbd_header.28c96316763845e: (16) Device or resource busy >> error from cache-flush-evict-all: (1) Operation not >> permitted >> > > These are probably the objects which have watchers attached. The current > evict logic seems to be unable to evict these, hence the > error. I'm not sure if anything can be done to work around this other than > what you have tried...ie stopping the VM, which will > remove the watcher. You can move them from cache pool once you remove tier overlay. Bu I wasn't sure about the data consistency. So I ran a few test to confirm. So I spawned a few VM's that were just idling, few that were writing small files to disk with consistent crc and few that were writing larger files with sync option to disk. I have run it multiple times, don't remember the number as I was really waiting for a crc mismatch or general VM crash, but it was 20+ times. You flush the cache a few times. Once no new objects appear in it. Do a flush follow by overlay removal. After about a minute header files will unlock and you will be able to flush them down to cold storage. Once that's done ran a crc check on the everything I was verifying. So I'm pretty confident that I will not lose any data while doing this on a live/production server. I will run a few more tests and decide what to do then. And If I do this on production I will report the progress. Maybe this will help others struggling with similar options. > >> I found a workaround for this. You can bypass these errors by running >> 1. ceph osd tier remove-overlay test-pool >> 2. turning off the VM's that are using them. >> >> For the second option. I can boot the VM's normally after recreating a new >> overlay/cauchetier. At this point everything is working > fine, >> but I'm trying to avoid downtime as it takes almost 8h to start and check >> everything to be in optimal condition. >> >> Now for the first part. I can remove the overlay and flush cache layer. And >> VM's are running fine with it removed. Issues start > after I >> have readed the cache layer to the cold pool and try to write/read from the >> disk. For no apparent reason VM's just freeze. And you >> need to force stop/start all VM's to start working. > > Which pool are the VM's being pointed at base or cache? I'm wondering if it's > something to do with the pool id changing? It was pointing to the base pool. So after reading about online I found that I can add it with live machines. Just need to run these commands: 1. "ceph osd tier add cold-pool cache-pool --force-nonempty" 2. "ceph osd tier cache-mode cache-pool forward" <--- no other mode seems to work only forward. Plus you need to wait a while for all rbd_header to reappear in this pool before switching cache-mode or the VM's will crash. 3. "ceph osd tier set-overlay cold-pool cache-pool" <--- after you run this header pools should start appearing in it. rados -p cache-pool ls > >> >> From what I have read about it all objects should leave cache tier and you >> don't have to "force" removing the tier with objects. >> >> Now onto the questions: >> >>1. Is it normal for VPS to freeze while adding a cache layer/tier? >>2. Do VMS' need to be offline to remove caching layer? >>3. I have read somewhere that snapshots might interfere with cache >> tier clean up. Is it true?4. Are there some other ways to >> remove the caching tier on a live system? >> >> >> Regards, >> >> >> Darius >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > Regards, Darius ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hello Nick, Thank you for your help. We have contacted RedHat for additional help and they think this bug is related to gmt bug in Version 94.7 of ceph. I'm not really sure how can this be as the cluster was using 94.6/94.9 versions. After a month + of slowly moving data I'm up with all the same versions of OS/Software on all the ceph cluster and need to recreate the cache layer to remove those missing hit set errors. The only solution so far was setting hit_set_count to 0 and removing the cache layer to fix it. I will update this ticket once I'm done with recreating the cache layer if those errors are gone completely. Regards, Darius On Fri, Nov 25, 2016 at 4:20 PM, Nick Fisk <n...@fisk.me.uk> wrote: > It might be worth trying to raise a ticket with those errors and say that you > believe they occurred after PG splitting on the cache tier and also include > the asserts you originally posted. > >> -Original Message- >> From: Daznis [mailto:daz...@gmail.com] >> Sent: 25 November 2016 13:59 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> I think it's because of these errors: >> >> 2016-11-25 14:51:25.644495 7fb73eef8700 -1 log_channel(cluster) log [ERR] : >> 14.28 deep-scrub stat mismatch, got 145/144 objects, 0/0 >> clones, 57/57 dirty, 0/0 omap, 54/53 hit_set_archive, 0/0 whiteouts, >> 365399477/365399252 bytes,51328/51103 hit_set_archive bytes. >> >> 2016-11-25 14:55:56.529405 7f89bae5a700 -1 log_channel(cluster) log [ERR] : >> 13.dd deep-scrub stat mismatch, got 149/148 objects, 0/0 >> clones, 55/55 dirty, 0/0 omap, 63/61 hit_set_archive, 0/0 whiteouts, >> 360765725/360765503 bytes,55581/54097 hit_set_archive bytes. >> >> I have no clue why they appeared. The cluster was running fine for months so >> I have no logs on how it happened. I just enabled them >> after "shit hit the fan". >> >> >> On Fri, Nov 25, 2016 at 12:26 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> > Possibly, do you know the exact steps to reproduce? I'm guessing the PG >> > splitting was the cause, but whether this on its own would >> cause the problem or also needs the introduction of new OSD's at the same >> time, might make tracing the cause hard. >> > >> >> -Original Message- >> >> From: Daznis [mailto:daz...@gmail.com] >> >> Sent: 24 November 2016 19:44 >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> I will try it, but I wanna see if it stays stable for a few days. Not >> >> sure if I should report this bug or not. >> >> >> >> On Thu, Nov 24, 2016 at 6:05 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> > Can you add them with different ID's, it won't look pretty but might >> >> > get you out of this situation? >> >> > >> >> >> -Original Message- >> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On >> >> >> Behalf Of Daznis >> >> >> Sent: 24 November 2016 15:43 >> >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> >> >> Yes, unfortunately, it is. And the story still continues. I have >> >> >> noticed that only 4 OSD are doing this and zapping and readding it >> >> >> does not solve the issue. Removing them completely from the >> >> >> cluster solve that issue, but I can't reuse their ID's. If I add >> >> >> another >> >> one with the same ID it starts doing the same "funky" crashes. For now >> >> the cluster remains "stable" without the OSD's. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Nov 23, 2016 at 4:00 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> >> > I take it you have size =2 or min_size=1 or something like that for >> >> >> > the cache pool? 1 OSD shouldn’t prevent PG's from >> recovering. >> >> >> > >> >> >> > Your best bet would be to see if the PG that is causing the >> >> >> > as
[ceph-users] Ceph cache tier removal.
Hello, I'm running preliminary test on cache tier removal on a live cluster, before I try to do that on a production one. I'm trying to avoid downtime, but from what I noticed it's either impossible or I'm doing something wrong. My cluster is running Centos 7.2 and 0.94.9 ceph. Example 1: I'm setting the cache layer to forward. 1. ceph osd tier cache-mode test-cache forward . Then flushing the cache: 1. rados -p test-cache cache-flush-evict-all Then I'm getting stuck with the some objects that can't be removed: rbd_header.29c3cdb2ae8944a failed to evict /rbd_header.29c3cdb2ae8944a: (16) Device or resource busy rbd_header.28c96316763845e failed to evict /rbd_header.28c96316763845e: (16) Device or resource busy error from cache-flush-evict-all: (1) Operation not permitted I found a workaround for this. You can bypass these errors by running 1. ceph osd tier remove-overlay test-pool 2. turning off the VM's that are using them. For the second option. I can boot the VM's normally after recreating a new overlay/cauchetier. At this point everything is working fine, but I'm trying to avoid downtime as it takes almost 8h to start and check everything to be in optimal condition. Now for the first part. I can remove the overlay and flush cache layer. And VM's are running fine with it removed. Issues start after I have readed the cache layer to the cold pool and try to write/read from the disk. For no apparent reason VM's just freeze. And you need to force stop/start all VM's to start working. >From what I have read about it all objects should leave cache tier and you don't have to "force" removing the tier with objects. Now onto the questions: 1. Is it normal for VPS to freeze while adding a cache layer/tier? 2. Do VMS' need to be offline to remove caching layer? 3. I have read somewhere that snapshots might interfere with cache tier clean up. Is it true?4. Are there some other ways to remove the caching tier on a live system? Regards, Darius ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph strange issue after adding a cache OSD.
I think it's because of these errors: 2016-11-25 14:51:25.644495 7fb73eef8700 -1 log_channel(cluster) log [ERR] : 14.28 deep-scrub stat mismatch, got 145/144 objects, 0/0 clones, 57/57 dirty, 0/0 omap, 54/53 hit_set_archive, 0/0 whiteouts, 365399477/365399252 bytes,51328/51103 hit_set_archive bytes. 2016-11-25 14:55:56.529405 7f89bae5a700 -1 log_channel(cluster) log [ERR] : 13.dd deep-scrub stat mismatch, got 149/148 objects, 0/0 clones, 55/55 dirty, 0/0 omap, 63/61 hit_set_archive, 0/0 whiteouts, 360765725/360765503 bytes,55581/54097 hit_set_archive bytes. I have no clue why they appeared. The cluster was running fine for months so I have no logs on how it happened. I just enabled them after "shit hit the fan". On Fri, Nov 25, 2016 at 12:26 PM, Nick Fisk <n...@fisk.me.uk> wrote: > Possibly, do you know the exact steps to reproduce? I'm guessing the PG > splitting was the cause, but whether this on its own would cause the problem > or also needs the introduction of new OSD's at the same time, might make > tracing the cause hard. > >> -Original Message- >> From: Daznis [mailto:daz...@gmail.com] >> Sent: 24 November 2016 19:44 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> I will try it, but I wanna see if it stays stable for a few days. Not sure >> if I should report this bug or not. >> >> On Thu, Nov 24, 2016 at 6:05 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> > Can you add them with different ID's, it won't look pretty but might get >> > you out of this situation? >> > >> >> -Original Message- >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> >> Of Daznis >> >> Sent: 24 November 2016 15:43 >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> Yes, unfortunately, it is. And the story still continues. I have >> >> noticed that only 4 OSD are doing this and zapping and readding it >> >> does not solve the issue. Removing them completely from the cluster solve >> >> that issue, but I can't reuse their ID's. If I add another >> one with the same ID it starts doing the same "funky" crashes. For now the >> cluster remains "stable" without the OSD's. >> >> >> >> >> >> >> >> >> >> On Wed, Nov 23, 2016 at 4:00 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> > I take it you have size =2 or min_size=1 or something like that for the >> >> > cache pool? 1 OSD shouldn’t prevent PG's from recovering. >> >> > >> >> > Your best bet would be to see if the PG that is causing the assert >> >> > can be removed and let the OSD start up. If you are lucky, the PG >> >> causing the problems might not be one which also has unfound objects, >> >> otherwise you are likely have to get heavily involved in recovering >> >> objects with the object store tool. >> >> > >> >> >> -Original Message- >> >> >> From: Daznis [mailto:daz...@gmail.com] >> >> >> Sent: 23 November 2016 13:56 >> >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> >> >> No, it's still missing some PGs and objects and can't recover as >> >> >> it's blocked by that OSD. I can boot the OSD up by removing all >> >> >> the PG related files from current directory, but that doesn't >> >> >> solve the missing objects problem. Not really sure if I can move >> >> >> the object >> >> back to their place manually, but I will try it. >> >> >> >> >> >> On Wed, Nov 23, 2016 at 3:08 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> >> > Sorry, I'm afraid I'm out of ideas about that one, that error >> >> >> > doesn't mean very much to me. The code suggests the OSD is >> >> >> > trying to >> >> >> get an attr from the disk/filesystem, but for some reason it >> >> >> doesn't like that. You could maybe whack the debug logging for OSD >> >> >> and filestore up to max
Re: [ceph-users] Ceph strange issue after adding a cache OSD.
I will try it, but I wanna see if it stays stable for a few days. Not sure if I should report this bug or not. On Thu, Nov 24, 2016 at 6:05 PM, Nick Fisk <n...@fisk.me.uk> wrote: > Can you add them with different ID's, it won't look pretty but might get you > out of this situation? > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Daznis >> Sent: 24 November 2016 15:43 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> Yes, unfortunately, it is. And the story still continues. I have noticed >> that only 4 OSD are doing this and zapping and readding it does >> not solve the issue. Removing them completely from the cluster solve that >> issue, but I can't reuse their ID's. If I add another one with >> the same ID it starts doing the same "funky" crashes. For now the cluster >> remains "stable" without the OSD's. >> >> >> >> >> On Wed, Nov 23, 2016 at 4:00 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> > I take it you have size =2 or min_size=1 or something like that for the >> > cache pool? 1 OSD shouldn’t prevent PG's from recovering. >> > >> > Your best bet would be to see if the PG that is causing the assert can be >> > removed and let the OSD start up. If you are lucky, the PG >> causing the problems might not be one which also has unfound objects, >> otherwise you are likely have to get heavily involved in >> recovering objects with the object store tool. >> > >> >> -Original Message- >> >> From: Daznis [mailto:daz...@gmail.com] >> >> Sent: 23 November 2016 13:56 >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> No, it's still missing some PGs and objects and can't recover as it's >> >> blocked by that OSD. I can boot the OSD up by removing all the PG >> >> related files from current directory, but that doesn't solve the missing >> >> objects problem. Not really sure if I can move the object >> back to their place manually, but I will try it. >> >> >> >> On Wed, Nov 23, 2016 at 3:08 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> > Sorry, I'm afraid I'm out of ideas about that one, that error >> >> > doesn't mean very much to me. The code suggests the OSD is trying >> >> > to >> >> get an attr from the disk/filesystem, but for some reason it doesn't >> >> like that. You could maybe whack the debug logging for OSD and >> >> filestore up to max and try and see what PG/file is accessed just before >> >> the crash, but I'm not sure what the fix would be, even if >> you manage to locate the dodgy PG. >> >> > >> >> > Does the cluster have all PG's recovered now? Unless anyone else >> >> > can comment, you might be best removing/wiping and then re- >> >> adding the OSD. >> >> > >> >> >> -Original Message- >> >> >> From: Daznis [mailto:daz...@gmail.com] >> >> >> Sent: 23 November 2016 12:55 >> >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> >> >> Thank you. That helped quite a lot. Now I'm just stuck with one OSD >> >> >> crashing with: >> >> >> >> >> >> osd/PG.cc: In function 'static int >> >> >> PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, >> >> >> ceph::bufferlist*)' thread 7f36bbdd6880 time >> >> >> 2016-11-23 13:42:43.27 >> >> >> 8539 >> >> >> osd/PG.cc: 2911: FAILED assert(r > 0) >> >> >> >> >> >> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) >> >> >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> >> >> const*)+0x85) [0xbde2c5] >> >> >> 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, >> >> >> ceph::buffer::list*)+0x8ba) [0x7cf4da] >> >> >> 3: (OSD::load_pgs()+0x9ef) [0x6bd31f] >> >> >&g
Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Yes, unfortunately, it is. And the story still continues. I have noticed that only 4 OSD are doing this and zapping and readding it does not solve the issue. Removing them completely from the cluster solve that issue, but I can't reuse their ID's. If I add another one with the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's. On Wed, Nov 23, 2016 at 4:00 PM, Nick Fisk <n...@fisk.me.uk> wrote: > I take it you have size =2 or min_size=1 or something like that for the cache > pool? 1 OSD shouldn’t prevent PG's from recovering. > > Your best bet would be to see if the PG that is causing the assert can be > removed and let the OSD start up. If you are lucky, the PG causing the > problems might not be one which also has unfound objects, otherwise you are > likely have to get heavily involved in recovering objects with the object > store tool. > >> -Original Message- >> From: Daznis [mailto:daz...@gmail.com] >> Sent: 23 November 2016 13:56 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> No, it's still missing some PGs and objects and can't recover as it's >> blocked by that OSD. I can boot the OSD up by removing all the PG >> related files from current directory, but that doesn't solve the missing >> objects problem. Not really sure if I can move the object back to >> their place manually, but I will try it. >> >> On Wed, Nov 23, 2016 at 3:08 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> > Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean >> > very much to me. The code suggests the OSD is trying to >> get an attr from the disk/filesystem, but for some reason it doesn't like >> that. You could maybe whack the debug logging for OSD and >> filestore up to max and try and see what PG/file is accessed just before the >> crash, but I'm not sure what the fix would be, even if you >> manage to locate the dodgy PG. >> > >> > Does the cluster have all PG's recovered now? Unless anyone else can >> > comment, you might be best removing/wiping and then re- >> adding the OSD. >> > >> >> -Original Message- >> >> From: Daznis [mailto:daz...@gmail.com] >> >> Sent: 23 November 2016 12:55 >> >> To: Nick Fisk <n...@fisk.me.uk> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> Thank you. That helped quite a lot. Now I'm just stuck with one OSD >> >> crashing with: >> >> >> >> osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, >> >> spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time >> >> 2016-11-23 13:42:43.27 >> >> 8539 >> >> osd/PG.cc: 2911: FAILED assert(r > 0) >> >> >> >> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) >> >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> >> const*)+0x85) [0xbde2c5] >> >> 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, >> >> ceph::buffer::list*)+0x8ba) [0x7cf4da] >> >> 3: (OSD::load_pgs()+0x9ef) [0x6bd31f] >> >> 4: (OSD::init()+0x181a) [0x6c0e8a] >> >> 5: (main()+0x29dd) [0x6484bd] >> >> 6: (__libc_start_main()+0xf5) [0x7f36b916bb15] >> >> 7: /usr/bin/ceph-osd() [0x661ea9] >> >> >> >> On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> >> -Original Message- >> >> >> From: Daznis [mailto:daz...@gmail.com] >> >> >> Sent: 23 November 2016 10:17 >> >> >> To: n...@fisk.me.uk >> >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> >> >> Hi, >> >> >> >> >> >> >> >> >> Looks like one of my colleagues increased the PG number before it >> >> >> finished. I was flushing the whole cache tier and it's currently >> >> >> stuck on ~80 GB of data, because of the OSD crashes. I will look >> >> >> into the hitset counts and check what can be done. Will provide an >> >> update if I find anything or fix the issue. >> >> > >> >
Re: [ceph-users] Ceph strange issue after adding a cache OSD.
No, it's still missing some PGs and objects and can't recover as it's blocked by that OSD. I can boot the OSD up by removing all the PG related files from current directory, but that doesn't solve the missing objects problem. Not really sure if I can move the object back to their place manually, but I will try it. On Wed, Nov 23, 2016 at 3:08 PM, Nick Fisk <n...@fisk.me.uk> wrote: > Sorry, I'm afraid I'm out of ideas about that one, that error doesn't mean > very much to me. The code suggests the OSD is trying to get an attr from the > disk/filesystem, but for some reason it doesn't like that. You could maybe > whack the debug logging for OSD and filestore up to max and try and see what > PG/file is accessed just before the crash, but I'm not sure what the fix > would be, even if you manage to locate the dodgy PG. > > Does the cluster have all PG's recovered now? Unless anyone else can comment, > you might be best removing/wiping and then re-adding the OSD. > >> -Original Message- >> From: Daznis [mailto:daz...@gmail.com] >> Sent: 23 November 2016 12:55 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> Thank you. That helped quite a lot. Now I'm just stuck with one OSD crashing >> with: >> >> osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, >> epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time >> 2016-11-23 13:42:43.27 >> 8539 >> osd/PG.cc: 2911: FAILED assert(r > 0) >> >> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0xbde2c5] >> 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, >> ceph::buffer::list*)+0x8ba) [0x7cf4da] >> 3: (OSD::load_pgs()+0x9ef) [0x6bd31f] >> 4: (OSD::init()+0x181a) [0x6c0e8a] >> 5: (main()+0x29dd) [0x6484bd] >> 6: (__libc_start_main()+0xf5) [0x7f36b916bb15] >> 7: /usr/bin/ceph-osd() [0x661ea9] >> >> On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> -Original Message- >> >> From: Daznis [mailto:daz...@gmail.com] >> >> Sent: 23 November 2016 10:17 >> >> To: n...@fisk.me.uk >> >> Cc: ceph-users <ceph-users@lists.ceph.com> >> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> Hi, >> >> >> >> >> >> Looks like one of my colleagues increased the PG number before it >> >> finished. I was flushing the whole cache tier and it's currently >> >> stuck on ~80 GB of data, because of the OSD crashes. I will look into the >> >> hitset counts and check what can be done. Will provide an >> update if I find anything or fix the issue. >> > >> > So I'm guessing when the PG split, the stats/hit_sets are not how the OSD >> > is expecting them to be and causes the crash. I would >> expect this has been caused by the PG splitting rather than introducing >> extra OSD's. If you manage to get things stable by bumping up >> the hitset count, then you probably want to try and do a scrub to try and >> clean up the stats, which may then stop this happening when >> the hitset comes round to being trimmed again. >> > >> >> >> >> >> >> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> >> > Hi Daznis, >> >> > >> >> > I'm not sure how much help I can be, but I will try my best. >> >> > >> >> > I think the post-split stats error is probably benign, although I >> >> > think this suggests you also increased the number of PG's in your >> >> > cache pool? If so did you do this before or after you added the >> >> extra OSD's? This may have been the cause. >> >> > >> >> > On to the actual assert, this looks like it's part of the code >> >> > which trims the tiering hit set's. I don't understand why its >> >> > crashing out, but it must be related to an invalid or missing >> >> > hitset I would >> >> imagine. >> >> > >> >> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L >> >> > 104 >> >> > 85 >> >> > >> >> > The only thing I could think of from looking at in the code is that >> >> > the function loops through all hitsets that a
Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Thank you. That helped quite a lot. Now I'm just stuck with one OSD crashing with: osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f36bbdd6880 time 2016-11-23 13:42:43.27 8539 osd/PG.cc: 2911: FAILED assert(r > 0) ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbde2c5] 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x8ba) [0x7cf4da] 3: (OSD::load_pgs()+0x9ef) [0x6bd31f] 4: (OSD::init()+0x181a) [0x6c0e8a] 5: (main()+0x29dd) [0x6484bd] 6: (__libc_start_main()+0xf5) [0x7f36b916bb15] 7: /usr/bin/ceph-osd() [0x661ea9] On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> -Original Message- >> From: Daznis [mailto:daz...@gmail.com] >> Sent: 23 November 2016 10:17 >> To: n...@fisk.me.uk >> Cc: ceph-users <ceph-users@lists.ceph.com> >> Subject: Re: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> Hi, >> >> >> Looks like one of my colleagues increased the PG number before it finished. >> I was flushing the whole cache tier and it's currently stuck >> on ~80 GB of data, because of the OSD crashes. I will look into the hitset >> counts and check what can be done. Will provide an update if >> I find anything or fix the issue. > > So I'm guessing when the PG split, the stats/hit_sets are not how the OSD is > expecting them to be and causes the crash. I would expect this has been > caused by the PG splitting rather than introducing extra OSD's. If you manage > to get things stable by bumping up the hitset count, then you probably want > to try and do a scrub to try and clean up the stats, which may then stop this > happening when the hitset comes round to being trimmed again. > >> >> >> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> > Hi Daznis, >> > >> > I'm not sure how much help I can be, but I will try my best. >> > >> > I think the post-split stats error is probably benign, although I >> > think this suggests you also increased the number of PG's in your cache >> > pool? If so did you do this before or after you added the >> extra OSD's? This may have been the cause. >> > >> > On to the actual assert, this looks like it's part of the code which >> > trims the tiering hit set's. I don't understand why its crashing out, but >> > it must be related to an invalid or missing hitset I would >> imagine. >> > >> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L104 >> > 85 >> > >> > The only thing I could think of from looking at in the code is that >> > the function loops through all hitsets that are above the max number >> > (hit_set_count). I wonder if setting this number higher would >> mean it won't try and trim any hitsets and let things recover? >> > >> > DISCLAIMER >> > This is a hunch, it might not work or could possibly even make things >> > worse. Otherwise wait for someone who has a better idea to comment. >> > >> > Nick >> > >> > >> > >> >> -Original Message- >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >> >> Of Daznis >> >> Sent: 23 November 2016 05:57 >> >> To: ceph-users <ceph-users@lists.ceph.com> >> >> Subject: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> >> >> Hello, >> >> >> >> >> >> The story goes like this. >> >> I have added another 3 drives to the caching layer. OSDs were added >> >> to crush map one by one after each successful rebalance. When >> > I >> >> added the last OSD and went away for about an hour I noticed that >> >> it's still not finished rebalancing. Further investigation >> > showed me >> >> that it one of the older cache SSD was restarting like crazy before >> >> full boot. So I shut it down and waited for a rebalance >> > without that >> >> OSD. Less than an hour later I had another 2 OSD restarting like >> >> crazy. I tried running scrubs on the PG's logs asked me to, but >> > that did >> >> not help. I'm currently stuck with " 8 scrub errors" and a complete dead >> >> cluster. >> >> >> >> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) >> >> stats; must scrub befor
Re: [ceph-users] Ceph strange issue after adding a cache OSD.
Hi, Looks like one of my colleagues increased the PG number before it finished. I was flushing the whole cache tier and it's currently stuck on ~80 GB of data, because of the OSD crashes. I will look into the hitset counts and check what can be done. Will provide an update if I find anything or fix the issue. On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk <n...@fisk.me.uk> wrote: > Hi Daznis, > > I'm not sure how much help I can be, but I will try my best. > > I think the post-split stats error is probably benign, although I think this > suggests you also increased the number of PG's in your > cache pool? If so did you do this before or after you added the extra OSD's? > This may have been the cause. > > On to the actual assert, this looks like it's part of the code which trims > the tiering hit set's. I don't understand why its > crashing out, but it must be related to an invalid or missing hitset I would > imagine. > > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485 > > The only thing I could think of from looking at in the code is that the > function loops through all hitsets that are above the max > number (hit_set_count). I wonder if setting this number higher would mean it > won't try and trim any hitsets and let things recover? > > DISCLAIMER > This is a hunch, it might not work or could possibly even make things worse. > Otherwise wait for someone who has a better idea to > comment. > > Nick > > > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Daznis >> Sent: 23 November 2016 05:57 >> To: ceph-users <ceph-users@lists.ceph.com> >> Subject: [ceph-users] Ceph strange issue after adding a cache OSD. >> >> Hello, >> >> >> The story goes like this. >> I have added another 3 drives to the caching layer. OSDs were added to crush >> map one by one after each successful rebalance. When > I >> added the last OSD and went away for about an hour I noticed that it's still >> not finished rebalancing. Further investigation > showed me >> that it one of the older cache SSD was restarting like crazy before full >> boot. So I shut it down and waited for a rebalance > without that >> OSD. Less than an hour later I had another 2 OSD restarting like crazy. I >> tried running scrubs on the PG's logs asked me to, but > that did >> not help. I'm currently stuck with " 8 scrub errors" and a complete dead >> cluster. >> >> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; >> must scrub before tier agent can activate >> >> >> I need help with OSD from crashing. Crash log: >> 0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1 >> osd/ReplicatedPG.cc: In function 'void >> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' >> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067 >> osd/ReplicatedPG.cc: 10521: FAILED assert(obc) >> >> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0xbde2c5] >> 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned >> int)+0x75f) [0x87e89f] >> 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb] >> 4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a) [0x8a11aa] >> 5: (ReplicatedPG::do_request(std::tr1::shared_ptr&, >> ThreadPool::TPHandle&)+0x68a) [0x83c37a] >> 6: (OSD::dequeue_op(boost::intrusive_ptr, >> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405) [0x69af05] >> 7: (OSD::ShardedOpWQ::_process(unsigned int, >> ceph::heartbeat_handle_d*)+0x333) [0x69b473] >> 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) >> [0xbcd9cf] >> 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00] >> 10: (()+0x7dc5) [0x7f93b9df4dc5] >> 11: (clone()+0x6d) [0x7f93b88d5ced] >> NOTE: a copy of the executable, or `objdump -rdS ` is needed to >> interpret this. >> >> >> I have tried looking with full debug enabled, but those logs didn't help me >> much. I have tried to evict the cache layer, but some >> objects are stuck and can't be removed. Any suggestions would be greatly >> appreciated. >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph strange issue after adding a cache OSD.
Hello, The story goes like this. I have added another 3 drives to the caching layer. OSDs were added to crush map one by one after each successful rebalance. When I added the last OSD and went away for about an hour I noticed that it's still not finished rebalancing. Further investigation showed me that it one of the older cache SSD was restarting like crazy before full boot. So I shut it down and waited for a rebalance without that OSD. Less than an hour later I had another 2 OSD restarting like crazy. I tried running scrubs on the PG's logs asked me to, but that did not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster. log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; must scrub before tier agent can activate I need help with OSD from crashing. Crash log: 0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7f935b4eb700 time 2016-11-23 06:41:43.363067 osd/ReplicatedPG.cc: 10521: FAILED assert(obc) ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbde2c5] 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x75f) [0x87e89f] 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb] 4: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a) [0x8a11aa] 5: (ReplicatedPG::do_request(std::tr1::shared_ptr&, ThreadPool::TPHandle&)+0x68a) [0x83c37a] 6: (OSD::dequeue_op(boost::intrusive_ptr, std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405) [0x69af05] 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x333) [0x69b473] 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf] 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00] 10: (()+0x7dc5) [0x7f93b9df4dc5] 11: (clone()+0x6d) [0x7f93b88d5ced] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. I have tried looking with full debug enabled, but those logs didn't help me much. I have tried to evict the cache layer, but some objects are stuck and can't be removed. Any suggestions would be greatly appreciated. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] I/O freeze while a single node is down.
Yes that one has +2 OSD's on it. root default { id -1 # do not change unnecessarily # weight 116.480 alg straw hash 0 # rjenkins1 item OSD-1 weight 36.400 item OSD-2 weight 36.400 item OSD-3 weight 43.680 } rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } On Tue, Sep 13, 2016 at 1:51 PM, Sean Redmond <sean.redmo...@gmail.com> wrote: > Hi, > > The host that is taken down has 12 disks in it? > > Have a look at the down PG's '18 pgs down' - I suspect this will be what is > causing the I/O freeze. > > Is your cursh map setup correctly to split data over different hosts? > > Thanks > > On Tue, Sep 13, 2016 at 11:45 AM, Daznis <daz...@gmail.com> wrote: >> >> No, no errors about that. I have set noout before it happened, but it >> still started recovery. I have added >> nobackfill,norebalance,norecover,noscrub,nodeep-scrub once i noticed >> it started doing crazy stuff. So recovery I/O stopped but the cluster >> can't read any info. Only writes to cache layer. >> >> cluster cdca2074-4c91-4047-a607-faebcbc1ee17 >> health HEALTH_WARN >> 2225 pgs degraded >> 18 pgs down >> 18 pgs peering >> 89 pgs stale >> 2225 pgs stuck degraded >> 18 pgs stuck inactive >> 89 pgs stuck stale >> 2257 pgs stuck unclean >> 2225 pgs stuck undersized >> 2225 pgs undersized >> recovery 4180820/11837906 objects degraded (35.317%) >> recovery 24016/11837906 objects misplaced (0.203%) >> 12/39 in osds are down >> noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub >> flag(s) set >> monmap e9: 7 mons at {} >> election epoch 170, quorum 0,1,2,3,4,5,6 >> osdmap e40290: 40 osds: 27 up, 39 in; 14 remapped pgs >> flags >> noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub >> pgmap v39326300: 4096 pgs, 4 pools, 21455 GB data, 5780 kobjects >> 42407 GB used, 75772 GB / 115 TB avail >> 4180820/11837906 objects degraded (35.317%) >> 24016/11837906 objects misplaced (0.203%) >> 2136 active+undersized+degraded >> 1837 active+clean >> 89 stale+active+undersized+degraded >> 18 down+peering >> 14 active+remapped >>2 active+clean+scrubbing+deep >> client io 0 B/s rd, 9509 kB/s wr, 3469 op/s >> >> On Tue, Sep 13, 2016 at 1:34 PM, M Ranga Swami Reddy >> <swamire...@gmail.com> wrote: >> > Please check if any osd is nearfull ERR. Can you please share the ceph >> > -s >> > o/p? >> > >> > Thanks >> > Swami >> > >> > On Tue, Sep 13, 2016 at 3:54 PM, Daznis <daz...@gmail.com> wrote: >> >> >> >> Hello, >> >> >> >> >> >> I have encountered a strange I/O freeze while rebooting one OSD node >> >> for maintenance purpose. It was one of the 3 Nodes in the entire >> >> cluster. Before this rebooting or shutting down and entire node just >> >> slowed down the ceph, but not completely froze it. >> >> ___ >> >> ceph-users mailing list >> >> ceph-users@lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] I/O freeze while a single node is down.
No, no errors about that. I have set noout before it happened, but it still started recovery. I have added nobackfill,norebalance,norecover,noscrub,nodeep-scrub once i noticed it started doing crazy stuff. So recovery I/O stopped but the cluster can't read any info. Only writes to cache layer. cluster cdca2074-4c91-4047-a607-faebcbc1ee17 health HEALTH_WARN 2225 pgs degraded 18 pgs down 18 pgs peering 89 pgs stale 2225 pgs stuck degraded 18 pgs stuck inactive 89 pgs stuck stale 2257 pgs stuck unclean 2225 pgs stuck undersized 2225 pgs undersized recovery 4180820/11837906 objects degraded (35.317%) recovery 24016/11837906 objects misplaced (0.203%) 12/39 in osds are down noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub flag(s) set monmap e9: 7 mons at {} election epoch 170, quorum 0,1,2,3,4,5,6 osdmap e40290: 40 osds: 27 up, 39 in; 14 remapped pgs flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub pgmap v39326300: 4096 pgs, 4 pools, 21455 GB data, 5780 kobjects 42407 GB used, 75772 GB / 115 TB avail 4180820/11837906 objects degraded (35.317%) 24016/11837906 objects misplaced (0.203%) 2136 active+undersized+degraded 1837 active+clean 89 stale+active+undersized+degraded 18 down+peering 14 active+remapped 2 active+clean+scrubbing+deep client io 0 B/s rd, 9509 kB/s wr, 3469 op/s On Tue, Sep 13, 2016 at 1:34 PM, M Ranga Swami Reddy <swamire...@gmail.com> wrote: > Please check if any osd is nearfull ERR. Can you please share the ceph -s > o/p? > > Thanks > Swami > > On Tue, Sep 13, 2016 at 3:54 PM, Daznis <daz...@gmail.com> wrote: >> >> Hello, >> >> >> I have encountered a strange I/O freeze while rebooting one OSD node >> for maintenance purpose. It was one of the 3 Nodes in the entire >> cluster. Before this rebooting or shutting down and entire node just >> slowed down the ceph, but not completely froze it. >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] I/O freeze while a single node is down.
Hello, I have encountered a strange I/O freeze while rebooting one OSD node for maintenance purpose. It was one of the 3 Nodes in the entire cluster. Before this rebooting or shutting down and entire node just slowed down the ceph, but not completely froze it. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com