[ceph-users] Ceph mount error and mds laggy
Hi Team, We are having a issue with mounting ceph and its toughing error "mount error 5 = Input/output error" and also MDS seems mds ceph-zstorage1 is laggy . Kindly provide help us on this issue. cluster a8c92ae6-6842-4fa2-bfc9-8cdefd28df5c health HEALTH_WARN mds ceph-zstorage1 is laggy mds0: Client 192.168.106.109 failing to respond to capability release mds0: Client 192.168.107.102 failing to respond to cache pressure mds0: Client 192.168.107.242 failing to respond to cache pressure mds0: Client 192.168.106.109 failing to respond to cache pressure mds0: Client 192.168.106.145 failing to respond to cache pressure mds0: Client ceph-zclient1.zoholabs.com failing to respond to cache pressure monmap e1: 3 mons at {ceph-zadmin=192.168.107.155:6789/0,ceph-zmonitor=192.168.107.247:6789/0,ceph-zmonitor1=192.168.107.246:6789/0} election epoch 16, quorum 0,1,2 ceph-zadmin,ceph-zmonitor1,ceph-zmonitor mdsmap e336584: 1/1/1 up {0=ceph-zstorage1=up:active(laggy or crashed)} osdmap e4892: 3 osds: 3 up, 3 in pgmap v19533586: 384 pgs, 3 pools, 989 GB data, 13738 kobjects 2177 GB used, 4326 GB / 6503 GB avail 384 active+clean While mounting mount -t ceph 192.168.107.155:6789,192.168.107.247:6789,192.168.107.246:6789:/ /home/sas/cide/ -o name=admin,secretfile=/etc/ceph/admin.secret mount error 5 = Input/output error Regards Prabu GJ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Jewel (10.2.7) osd suicide timeout while deep-scrub
Yes, you can set it on the one one. That configuration is for an entirely internal system and can mismatch across OSDs without trouble. On Tue, Aug 15, 2017 at 4:25 PM Andreas Calminder < andreas.calmin...@klarna.com> wrote: > Thanks, I'll try and do that. Since I'm running a cluster with > multiple nodes, do I have to set this in ceph.conf on all nodes or > does it suffice with just the node with that particular osd? > > On 15 August 2017 at 22:51, Gregory Farnum wrote: > > > > > > On Tue, Aug 15, 2017 at 7:03 AM Andreas Calminder > > wrote: > >> > >> Hi, > >> I got hit with osd suicide timeouts while deep-scrub runs on a > >> specific pg, there's a RH article > >> (https://access.redhat.com/solutions/2127471) suggesting changing > >> osd_scrub_thread_suicide_timeout' from 60s to a higher value, problem > >> is the article is for Hammer and the osd_scrub_thread_suicide_timeout > >> doesn't exist when running > >> ceph daemon osd.34 config show > >> and the default timeout (60s) suggested in the article doesn't really > >> match the sucide timeout time in the logs: > >> > >> 2017-08-15 15:39:37.512216 7fb293137700 1 heartbeat_map is_healthy > >> 'OSD::osd_op_tp thread 0x7fb231adf700' had suicide timed out after 150 > >> 2017-08-15 15:39:37.518543 7fb293137700 -1 common/HeartbeatMap.cc: In > >> function 'bool ceph::HeartbeatMap::_check(const > >> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fb293137700 > >> time 2017-08-15 15:39:37.512230 > >> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") > >> > >> The suicide timeout (150) does match the > >> osd_op_thread_suicide_timeout, however when I try changing this I get: > >> ceph daemon osd.34 config set osd_op_thread_suicide_timeout 300 > >> { > >> "success": "osd_op_thread_suicide_timeout = '300' (unchangeable) " > >> } > >> > >> And the deep scrub will sucide timeout after 150 seconds, just like > >> before. > >> > >> The cluster is left with osd.34 flapping. Is there any way to let the > >> deep-scrub finish and get out of the infinite deep-scrub loop? > > > > > > You can set that option in ceph.conf. It's "unchangeable" because it's > used > > to initialize some other structures at boot so you can't edit it live. > > > >> > >> > >> Regards, > >> Andreas > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Jewel (10.2.7) osd suicide timeout while deep-scrub
Thanks, I'll try and do that. Since I'm running a cluster with multiple nodes, do I have to set this in ceph.conf on all nodes or does it suffice with just the node with that particular osd? On 15 August 2017 at 22:51, Gregory Farnum wrote: > > > On Tue, Aug 15, 2017 at 7:03 AM Andreas Calminder > wrote: >> >> Hi, >> I got hit with osd suicide timeouts while deep-scrub runs on a >> specific pg, there's a RH article >> (https://access.redhat.com/solutions/2127471) suggesting changing >> osd_scrub_thread_suicide_timeout' from 60s to a higher value, problem >> is the article is for Hammer and the osd_scrub_thread_suicide_timeout >> doesn't exist when running >> ceph daemon osd.34 config show >> and the default timeout (60s) suggested in the article doesn't really >> match the sucide timeout time in the logs: >> >> 2017-08-15 15:39:37.512216 7fb293137700 1 heartbeat_map is_healthy >> 'OSD::osd_op_tp thread 0x7fb231adf700' had suicide timed out after 150 >> 2017-08-15 15:39:37.518543 7fb293137700 -1 common/HeartbeatMap.cc: In >> function 'bool ceph::HeartbeatMap::_check(const >> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fb293137700 >> time 2017-08-15 15:39:37.512230 >> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") >> >> The suicide timeout (150) does match the >> osd_op_thread_suicide_timeout, however when I try changing this I get: >> ceph daemon osd.34 config set osd_op_thread_suicide_timeout 300 >> { >> "success": "osd_op_thread_suicide_timeout = '300' (unchangeable) " >> } >> >> And the deep scrub will sucide timeout after 150 seconds, just like >> before. >> >> The cluster is left with osd.34 flapping. Is there any way to let the >> deep-scrub finish and get out of the infinite deep-scrub loop? > > > You can set that option in ceph.conf. It's "unchangeable" because it's used > to initialize some other structures at boot so you can't edit it live. > >> >> >> Regards, >> Andreas >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v12.1.4 Luminous (RC) released
On Tue, Aug 15, 2017 at 2:05 PM, Abhishek wrote: > This is the fifth release candidate for Luminous, the next long term > stable release. We’ve had to do this release as there was a bug in > the previous RC, which affected upgrades to Luminous.[1] In particular, this will fix things for those of you who upgraded from Jewel or a previous RC and saw OSDs crash instantly on boot. We had an oversight in dealing with another bug. (Standard disclaimer: this was a logic error that resulted in no data changes. There were no durability implications — not that that helps much when you can't read your data out again.) Sorry guys! -Greg > > Please note that this is still a *release candidate* and > not the final release, we're expecting the final Luminous release in > a week's time, meanwhile, testing and feedback is very much welcom. > > Ceph Luminous (v12.2.0) will be the foundation for the next long-term > stable release series. There have been major changes since Kraken > (v11.2.z) and Jewel (v10.2.z), and the upgrade process is non-trivial. > Please read these release notes carefully. Full details and changelog at > http://ceph.com/releases/v12-1-4-luminous-rc-released/ > > Notable Changes from 12.1.3 > --- > * core: Wip 20985 divergent handling luminous (issue#20985, pr#17001, Greg > Farnum) > * qa/tasks/thrashosds-health.yaml: ignore MON_DOWN (issue#20910, pr#17003, > Sage Weil) > * crush, mon: fix weight set vs crush device classes (issue#20939, Sage > Weil) > > > Getting Ceph > > * Git at git://github.com/ceph/ceph.git > * Tarball at http://download.ceph.com/tarballs/ceph-12.1.4.tar.gz > * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ > * For ceph-deploy, see > http://docs.ceph.com/docs/master/install/install-ceph-deploy > * Release sha1: a5f84b37668fc8e03165aaf5cbb380c78e4deba4 > > [1]: http://tracker.ceph.com/issues/20985 > > > Best Regards > Abhishek > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cluster unavailable for 20 mins when downed server was reintroduced
Sounds like you've got a few different things happening here. On Tue, Aug 15, 2017 at 4:23 AM Sean Purdy wrote: > Luminous 12.1.1 rc1 > > Hi, > > > I have a three node cluster with 6 OSD and 1 mon per node. > > I had to turn off one node for rack reasons. While the node was down, the > cluster was still running and accepting files via radosgw. However, when I > turned the machine back on, radosgw uploads stopped working and things like > "ceph status" starting timed out. It took 20 minutes for "ceph status" to > be OK. > > In the recent past I've rebooted one or other node and the cluster kept > working, and when the machine came back, the OSDs and monitor rejoined the > cluster and things went on as usual. > > The machine was off for 21 hours or so. > > Any idea what might be happening, and how to mitigate the effects of this > next time a machine has to be down for any length of time? > > > "ceph status" said: > > 2017-08-15 11:28:29.835943 7fdf2d74b700 0 monclient(hunting): > authenticate timed out after 3002017-08-15 > 11:28:29.835993 7fdf2d74b700 0 librados: client.admin authentication error > (110) Connection timed out > That just means the client couldn't connect to an in-quorum monitor. It should have tried them all in sequence though — did you check if you had *any* functioning quorum? > > > monitor log said things like this before everything came together: > > 2017-08-15 11:23:07.180123 7f11c0fcc700 0 -- 172.16.0.43:0/2471 >> > 172.16.0.45:6812/1904 conn(0x556eeaf4d000 :-1 > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 > l=0).handle_connect_reply connect got BADAUTHORIZER > This one's odd. We did get one report of seeing something like that, but I tend to think it's a clock sync issue. > > but "ceph --admin-daemon /var/run/ceph/ceph-mon.xxx.asok quorum_status" > did work. This monitor node was detected but not yet in quorum. > > > OSDs had 15 minutes of > > ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-9: (2) No > such file or directory > > And that would appear to be something happening underneath Ceph, wherein your data wasn't actually all the way mounted or something? Anyway, it should have survived that transition without any noticeable impact (unless you are running so close to capacity that merely getting the downed node up-to-date overwhelmed your disks/cpu). But without some basic information about what the cluster as a whole was doing I couldn't speculate. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v12.1.4 Luminous (RC) released
This is the fifth release candidate for Luminous, the next long term stable release. We’ve had to do this release as there was a bug in the previous RC, which affected upgrades to Luminous.[1] Please note that this is still a *release candidate* and not the final release, we're expecting the final Luminous release in a week's time, meanwhile, testing and feedback is very much welcom. Ceph Luminous (v12.2.0) will be the foundation for the next long-term stable release series. There have been major changes since Kraken (v11.2.z) and Jewel (v10.2.z), and the upgrade process is non-trivial. Please read these release notes carefully. Full details and changelog at http://ceph.com/releases/v12-1-4-luminous-rc-released/ Notable Changes from 12.1.3 --- * core: Wip 20985 divergent handling luminous (issue#20985, pr#17001, Greg Farnum) * qa/tasks/thrashosds-health.yaml: ignore MON_DOWN (issue#20910, pr#17003, Sage Weil) * crush, mon: fix weight set vs crush device classes (issue#20939, Sage Weil) Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-12.1.4.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * For ceph-deploy, see http://docs.ceph.com/docs/master/install/install-ceph-deploy * Release sha1: a5f84b37668fc8e03165aaf5cbb380c78e4deba4 [1]: http://tracker.ceph.com/issues/20985 Best Regards Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Jewel (10.2.7) osd suicide timeout while deep-scrub
On Tue, Aug 15, 2017 at 7:03 AM Andreas Calminder < andreas.calmin...@klarna.com> wrote: > Hi, > I got hit with osd suicide timeouts while deep-scrub runs on a > specific pg, there's a RH article > (https://access.redhat.com/solutions/2127471) suggesting changing > osd_scrub_thread_suicide_timeout' from 60s to a higher value, problem > is the article is for Hammer and the osd_scrub_thread_suicide_timeout > doesn't exist when running > ceph daemon osd.34 config show > and the default timeout (60s) suggested in the article doesn't really > match the sucide timeout time in the logs: > > 2017-08-15 15:39:37.512216 7fb293137700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fb231adf700' had suicide timed out after 150 > 2017-08-15 15:39:37.518543 7fb293137700 -1 common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(const > ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fb293137700 > time 2017-08-15 15:39:37.512230 > common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") > > The suicide timeout (150) does match the > osd_op_thread_suicide_timeout, however when I try changing this I get: > ceph daemon osd.34 config set osd_op_thread_suicide_timeout 300 > { > "success": "osd_op_thread_suicide_timeout = '300' (unchangeable) " > } > > And the deep scrub will sucide timeout after 150 seconds, just like before. > > The cluster is left with osd.34 flapping. Is there any way to let the > deep-scrub finish and get out of the infinite deep-scrub loop? > You can set that option in ceph.conf. It's "unchangeable" because it's used to initialize some other structures at boot so you can't edit it live. > > Regards, > Andreas > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] error: cluster_uuid file exists with value
Hi, After adding a new monitor cluster I'm getting an estrange error: vdicnode02/store.db/MANIFEST-86 succeeded,manifest_file_number is 86, next_file_number is 88, last_sequence is 8, log_number is 0,prev_log_number is 0,max_column_family is 0 2017-08-15 22:00:58.832599 7f6791187e40 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ ceph-12.1.3/src/rocksdb/db/version_set.cc:2867] Column family [default] (ID 0), log number is 85 2017-08-15 22:00:58.832699 7f6791187e40 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1502827258832681, "job": 1, "event": "recovery_started", "log_files": [87]} 2017-08-15 22:00:58.832726 7f6791187e40 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ ceph-12.1.3/src/rocksdb/db/db_impl_open.cc:482] Recovering log #87 mode 2 2017-08-15 22:00:58.832887 7f6791187e40 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ ceph-12.1.3/src/rocksdb/db/version_set.cc:2395] Creating manifest 89 2017-08-15 22:00:58.850503 7f6791187e40 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1502827258850484, "job": 1, "event": "recovery_finished"} 2017-08-15 22:00:58.852552 7f6791187e40 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_ 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ ceph-12.1.3/src/rocksdb/db/db_impl_open.cc:1063] DB pointer 0x7f679c4bc000 2017-08-15 22:00:58.853155 7f6791187e40 0 starting mon.vdicnode02 rank 1 at public addr 192.168.100.102:6789/0 at bind addr 192.168.100.102:6789/0 mon_data /var/lib/ceph/mon/ceph-vdicnode02 fsid 61881df3-1365-4139-a586- 92b5eca9cf18 2017-08-15 22:00:58.853329 7f6791187e40 0 starting mon.vdicnode02 rank 1 at 192.168.100.102:6789/0 mon_data /var/lib/ceph/mon/ceph-vdicnode02 fsid 61881df3-1365-4139-a586-92b5eca9cf18 2017-08-15 22:00:58.853685 7f6791187e40 1 mon.vdicnode02@-1(probing) e1 preinit fsid 61881df3-1365-4139-a586-92b5eca9cf18 *2017-08-15 22:00:58.853759 7f6791187e40 -1 mon.vdicnode02@-1(probing) e1 error: cluster_uuid file exists with value d6b54a37-1cbe-483a-94c0-703e072aa6fd, != our uuid 61881df3-1365-4139-a586-92b5eca9cf18* 2017-08-15 22:00:58.853821 7f6791187e40 -1 failed to initialize Anybody has experienced the same issue or have been able to fix it? [root@vdicnode02 ceph]# cat /etc/ceph/ceph.conf [global] fsid = 61881df3-1365-4139-a586-92b5eca9cf18 public_network = 192.168.100.0/24 cluster_network = 192.168.100.0/24 mon_initial_members = vdicnode01,vdicnode02 mon_host = 192.168.100.101,192.168.100.102 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx osd pool default size = 2 rbd_default_format = 2 rbd_cache = false [mon.vdicnode01] host = vdicnode01 addr = 192.168.100.101:6789 [mon.vdicnode02] host = vdicnode02 addr = 192.168.100.102:6789 Thanks a lot ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Jewel (10.2.7) osd suicide timeout while deep-scrub
I am Not Sure but perhaps nodown/out could help to Finish? - Mehmet Am 15. August 2017 16:01:57 MESZ schrieb Andreas Calminder : >Hi, >I got hit with osd suicide timeouts while deep-scrub runs on a >specific pg, there's a RH article >(https://access.redhat.com/solutions/2127471) suggesting changing >osd_scrub_thread_suicide_timeout' from 60s to a higher value, problem >is the article is for Hammer and the osd_scrub_thread_suicide_timeout >doesn't exist when running >ceph daemon osd.34 config show >and the default timeout (60s) suggested in the article doesn't really >match the sucide timeout time in the logs: > >2017-08-15 15:39:37.512216 7fb293137700 1 heartbeat_map is_healthy >'OSD::osd_op_tp thread 0x7fb231adf700' had suicide timed out after 150 >2017-08-15 15:39:37.518543 7fb293137700 -1 common/HeartbeatMap.cc: In >function 'bool ceph::HeartbeatMap::_check(const >ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fb293137700 >time 2017-08-15 15:39:37.512230 >common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") > >The suicide timeout (150) does match the >osd_op_thread_suicide_timeout, however when I try changing this I get: >ceph daemon osd.34 config set osd_op_thread_suicide_timeout 300 >{ >"success": "osd_op_thread_suicide_timeout = '300' (unchangeable) " >} > >And the deep scrub will sucide timeout after 150 seconds, just like >before. > >The cluster is left with osd.34 flapping. Is there any way to let the >deep-scrub finish and get out of the infinite deep-scrub loop? > >Regards, >Andreas >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Two mons
Thanks a lot Greg, nice to hear! 2017-08-15 21:17 GMT+02:00 Gregory Farnum : > On Tue, Aug 15, 2017 at 10:28 AM David Turner > wrote: > >> There is nothing that will stop you from having an even number of mons >> (including 2). You just run the chance of getting into a split brain >> scenario. >> > > Just a note: even numbers of monitors mean that it's easier to lose > quorums, but they cannot create a classic split brain (in which two sets of > monitors both think they are in charge). We work very hard to avoid that > situation ever arising in RADOS. :) > -Greg > > > As long as you aren't planning to stay in that scenario, I don't see a >> problem with it. I have 3 mons in my home cluster and I've had to remove >> one before leaving me with 2 for a few hours while I re-provisioned the >> third and nothing funky happened. >> >> Most ways to deploy a cluster allow you to create the cluster with 3+ >> mons at the same time (inital_mons). What are you doing that only allows >> you to add one at a time? >> >> On Tue, Aug 15, 2017 at 12:22 PM Oscar Segarra >> wrote: >> >>> Hi, >>> >>> I'd like to test and script the adding monitors process adding one by >>> one monitors to the ceph infrastructure. >>> >>> Is it possible to have two mon's running on two servers (one mon each) >>> --> I can assume that mon quorum won't be reached until both servers are up. >>> >>> Is this right? >>> >>> I have not been able to find any documentation about the behaviour of >>> the sistem with just two monitors (or an even number of them). >>> >>> thanks a lot. >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Two mons
On Tue, Aug 15, 2017 at 10:28 AM David Turner wrote: > There is nothing that will stop you from having an even number of mons > (including 2). You just run the chance of getting into a split brain > scenario. > Just a note: even numbers of monitors mean that it's easier to lose quorums, but they cannot create a classic split brain (in which two sets of monitors both think they are in charge). We work very hard to avoid that situation ever arising in RADOS. :) -Greg As long as you aren't planning to stay in that scenario, I don't see a > problem with it. I have 3 mons in my home cluster and I've had to remove > one before leaving me with 2 for a few hours while I re-provisioned the > third and nothing funky happened. > > Most ways to deploy a cluster allow you to create the cluster with 3+ mons > at the same time (inital_mons). What are you doing that only allows you to > add one at a time? > > On Tue, Aug 15, 2017 at 12:22 PM Oscar Segarra > wrote: > >> Hi, >> >> I'd like to test and script the adding monitors process adding one by one >> monitors to the ceph infrastructure. >> >> Is it possible to have two mon's running on two servers (one mon each) >> --> I can assume that mon quorum won't be reached until both servers are up. >> >> Is this right? >> >> I have not been able to find any documentation about the behaviour of the >> sistem with just two monitors (or an even number of them). >> >> thanks a lot. >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Two mons
Hi David, Thanks a lot for your quick response... *What are you doing that only allows you to add one at a time?* I'm trying to create a scrip for adding/removing a mon in my environment --> I want to execute it from a simple web page... Thanks a lot! 2017-08-15 19:26 GMT+02:00 David Turner : > There is nothing that will stop you from having an even number of mons > (including 2). You just run the chance of getting into a split brain > scenario. As long as you aren't planning to stay in that scenario, I don't > see a problem with it. I have 3 mons in my home cluster and I've had to > remove one before leaving me with 2 for a few hours while I re-provisioned > the third and nothing funky happened. > > Most ways to deploy a cluster allow you to create the cluster with 3+ mons > at the same time (inital_mons). What are you doing that only allows you to > add one at a time? > > On Tue, Aug 15, 2017 at 12:22 PM Oscar Segarra > wrote: > >> Hi, >> >> I'd like to test and script the adding monitors process adding one by one >> monitors to the ceph infrastructure. >> >> Is it possible to have two mon's running on two servers (one mon each) >> --> I can assume that mon quorum won't be reached until both servers are up. >> >> Is this right? >> >> I have not been able to find any documentation about the behaviour of the >> sistem with just two monitors (or an even number of them). >> >> thanks a lot. >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Two mons
There is nothing that will stop you from having an even number of mons (including 2). You just run the chance of getting into a split brain scenario. As long as you aren't planning to stay in that scenario, I don't see a problem with it. I have 3 mons in my home cluster and I've had to remove one before leaving me with 2 for a few hours while I re-provisioned the third and nothing funky happened. Most ways to deploy a cluster allow you to create the cluster with 3+ mons at the same time (inital_mons). What are you doing that only allows you to add one at a time? On Tue, Aug 15, 2017 at 12:22 PM Oscar Segarra wrote: > Hi, > > I'd like to test and script the adding monitors process adding one by one > monitors to the ceph infrastructure. > > Is it possible to have two mon's running on two servers (one mon each) --> > I can assume that mon quorum won't be reached until both servers are up. > > Is this right? > > I have not been able to find any documentation about the behaviour of the > sistem with just two monitors (or an even number of them). > > thanks a lot. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Two mons
Hi, I'd like to test and script the adding monitors process adding one by one monitors to the ceph infrastructure. Is it possible to have two mon's running on two servers (one mon each) --> I can assume that mon quorum won't be reached until both servers are up. Is this right? I have not been able to find any documentation about the behaviour of the sistem with just two monitors (or an even number of them). thanks a lot. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Atomic object replacement with libradosstriper
Hello, Ceph users, I would like to use RADOS as an object storage (I have written about it to this list a while ago), and I would like to use libradosstriper with C, as has been suggested to me here. My question is - when writing an object, is it possible to do it so that either the old version as a whole or a new version as a whole is visible by readers at all times? Also, when creating a new object, only the fully written new object should be visible. Is it possible to do this with libradosstriper? With POSIX filesystem, one would do write(tmpfile)+fsync()+rename() to achieve similar results. Thanks! -Yenya -- | Jan "Yenya" Kasprzak | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | > That's why this kind of vulnerability is a concern: deploying stuff is < > often about collecting an obscene number of .jar files and pushing them < > up to the application server. --pboddie at LWN < ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Cluster attempt to access beyond end of device
Hi Hauke, It's possibly the XFS issue as discussed in the previous thread. I also saw this issue in some JBOD setup, running with RHEL 7.3 Sincerely, Yuan On Tue, Aug 15, 2017 at 7:38 PM, Hauke Homburg wrote: > Hello, > > > I found some error in the Cluster with dmes -T: > > attempt to access beyond end of device > > I found the following Post: > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39101.html > > Is this a Problem with the Size of the Filesystem itself oder "only" > eine Driver Bug? I ask becaue we habe in each Node 8 HDD with a Hardware > RAID 6 running. In this RAID we have the XFS Partition. > > Also we have one big Filesystem in 1 OSD in each Server instead of 1 > Filesystem per HDD at 8 HDD in each Server. > > greetings > > Hauke > > > -- > www.w3-creative.de > > www.westchat.de > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Jewel (10.2.7) osd suicide timeout while deep-scrub
Hi, I got hit with osd suicide timeouts while deep-scrub runs on a specific pg, there's a RH article (https://access.redhat.com/solutions/2127471) suggesting changing osd_scrub_thread_suicide_timeout' from 60s to a higher value, problem is the article is for Hammer and the osd_scrub_thread_suicide_timeout doesn't exist when running ceph daemon osd.34 config show and the default timeout (60s) suggested in the article doesn't really match the sucide timeout time in the logs: 2017-08-15 15:39:37.512216 7fb293137700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fb231adf700' had suicide timed out after 150 2017-08-15 15:39:37.518543 7fb293137700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fb293137700 time 2017-08-15 15:39:37.512230 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") The suicide timeout (150) does match the osd_op_thread_suicide_timeout, however when I try changing this I get: ceph daemon osd.34 config set osd_op_thread_suicide_timeout 300 { "success": "osd_op_thread_suicide_timeout = '300' (unchangeable) " } And the deep scrub will sucide timeout after 150 seconds, just like before. The cluster is left with osd.34 flapping. Is there any way to let the deep-scrub finish and get out of the infinite deep-scrub loop? Regards, Andreas ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous OSD startup errors
On 2017-08-15 15:38, Andras Pataki wrote: Thanks for the quick response and the pointer. The dev build fixed the issue. Andras On 08/15/2017 09:19 AM, Jason Dillaman wrote: I believe this is a known issue [1] and that there will potentially be a new 12.1.4 RC released because of it. The tracker ticket has a link to a set of development packages that should resolve the issue in the meantime. We've just started building packages for 12.1.4, so we should be able to get this out of the door soon Best, Abhishek [1] http://tracker.ceph.com/issues/20985 On Tue, Aug 15, 2017 at 9:08 AM, Andras Pataki wrote: After upgrading to the latest Luminous RC (12.1.3), all our OSD's are crashing with the following assert: 0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set >*, bool) [with missing_type = pg_missing_set; std::ostringstream = std::basic_ostringstream]' thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: 1301: FAILED assert(force_rebuild_missing) ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55d0f2be3b50] 2: (void PGLog::read_log_and_missing >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set&, bool, std::basic_ostringstream, std::allocator >&, bool, bool*, DoutPrefixProvider const*, std::setstd::less, std::allocator >*, bool)+0x773) [0x55d0f276f013] 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b) [0x55d0f272739b] 4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea] 5: (OSD::init()+0x2179) [0x55d0f268c319] 6: (main()+0x2def) [0x55d0f2591ccf] 7: (__libc_start_main()+0xf5) [0x7f9b727d6b35] 8: (()+0x4ac006) [0x55d0f2630006] Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in read_log_missing) was: if (p->key() == "divergent_priors") { ::decode(divergent_priors, bp); ldpp_dout(dpp, 20) << "read_log_and_missing " << divergent_priors.size() << " divergent_priors" << dendl; has_divergent_priors = true; debug_verify_stored_missing = false; to if (p->key() == "divergent_priors") { ::decode(divergent_priors, bp); ldpp_dout(dpp, 20) << "read_log_and_missing " << divergent_priors.size() << " divergent_priors" << dendl; assert(force_rebuild_missing); debug_verify_stored_missing = false; and it seems like force_rebuild_missing is not being set. This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3. So it seems something didn't happen correctly during the upgrade. Any ideas how to fix it? Andras ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous OSD startup errors
Thanks for the quick response and the pointer. The dev build fixed the issue. Andras On 08/15/2017 09:19 AM, Jason Dillaman wrote: I believe this is a known issue [1] and that there will potentially be a new 12.1.4 RC released because of it. The tracker ticket has a link to a set of development packages that should resolve the issue in the meantime. [1] http://tracker.ceph.com/issues/20985 On Tue, Aug 15, 2017 at 9:08 AM, Andras Pataki wrote: After upgrading to the latest Luminous RC (12.1.3), all our OSD's are crashing with the following assert: 0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set >*, bool) [with missing_type = pg_missing_set; std::ostringstream = std::basic_ostringstream]' thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: 1301: FAILED assert(force_rebuild_missing) ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55d0f2be3b50] 2: (void PGLog::read_log_and_missing >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set&, bool, std::basic_ostringstream, std::allocator >&, bool, bool*, DoutPrefixProvider const*, std::set, std::allocator >*, bool)+0x773) [0x55d0f276f013] 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b) [0x55d0f272739b] 4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea] 5: (OSD::init()+0x2179) [0x55d0f268c319] 6: (main()+0x2def) [0x55d0f2591ccf] 7: (__libc_start_main()+0xf5) [0x7f9b727d6b35] 8: (()+0x4ac006) [0x55d0f2630006] Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in read_log_missing) was: if (p->key() == "divergent_priors") { ::decode(divergent_priors, bp); ldpp_dout(dpp, 20) << "read_log_and_missing " << divergent_priors.size() << " divergent_priors" << dendl; has_divergent_priors = true; debug_verify_stored_missing = false; to if (p->key() == "divergent_priors") { ::decode(divergent_priors, bp); ldpp_dout(dpp, 20) << "read_log_and_missing " << divergent_priors.size() << " divergent_priors" << dendl; assert(force_rebuild_missing); debug_verify_stored_missing = false; and it seems like force_rebuild_missing is not being set. This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3. So it seems something didn't happen correctly during the upgrade. Any ideas how to fix it? Andras ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version support object-map feature from rbd kernel client
I dont think so , I tested with with kernel-4.10.17-1-pve which is proxmox5 kernel and that one didnt have object-map support had to disable the feature from the rbd image in order for the krbd rbd module to deal with it and not complain about features Thanks On Tue, Aug 15, 2017 at 9:25 AM, David Turner wrote: > I thought that object-map, introduced with Jewel, was included with the > 4.9 kernel and every kernel since then. > > On Tue, Aug 15, 2017, 7:26 AM Shinobu Kinjo wrote: > >> It would be much better to explain why as of today, object-map feature >> is not supported by the kernel client, or document it. >> >> On Tue, Aug 15, 2017 at 8:08 PM, Ilya Dryomov wrote: >> > On Tue, Aug 15, 2017 at 11:34 AM, moftah moftah >> wrote: >> >> Hi All, >> >> >> >> I have search everywhere for some sort of table that show kernel >> version to >> >> what rbd image features supported and didnt find any. >> >> >> >> basically I am looking at latest kernels from kernel.org , and i am >> thinking >> >> of upgrading to 4.12 since it is stable but i want to make sure i can >> get >> >> rbd images with object-map features working with rbd.ko >> >> >> >> if anyone know please let me know what kernel version i have to >> upgrade to >> >> to get that feature supported by kernel client >> > >> > As of today, object-map feature is not supported by the kernel client. >> > >> > Thanks, >> > >> > Ilya >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version support object-map feature from rbd kernel client
I believe you are thinking of the "exclusive-lock" feature which has been supported since kernel v4.9. The latest kernel only supports layering, exclusive-lock, and data-pool features. There is also support for tolerating the striping feature when it's (erroneously) enabled on an image but doesn't actually use fancy striping (i.e. it works when the stripe unit is the object size and the stripe count is one). On Tue, Aug 15, 2017 at 9:25 AM, David Turner wrote: > I thought that object-map, introduced with Jewel, was included with the 4.9 > kernel and every kernel since then. > > > On Tue, Aug 15, 2017, 7:26 AM Shinobu Kinjo wrote: >> >> It would be much better to explain why as of today, object-map feature >> is not supported by the kernel client, or document it. >> >> On Tue, Aug 15, 2017 at 8:08 PM, Ilya Dryomov wrote: >> > On Tue, Aug 15, 2017 at 11:34 AM, moftah moftah >> > wrote: >> >> Hi All, >> >> >> >> I have search everywhere for some sort of table that show kernel >> >> version to >> >> what rbd image features supported and didnt find any. >> >> >> >> basically I am looking at latest kernels from kernel.org , and i am >> >> thinking >> >> of upgrading to 4.12 since it is stable but i want to make sure i can >> >> get >> >> rbd images with object-map features working with rbd.ko >> >> >> >> if anyone know please let me know what kernel version i have to upgrade >> >> to >> >> to get that feature supported by kernel client >> > >> > As of today, object-map feature is not supported by the kernel client. >> > >> > Thanks, >> > >> > Ilya >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version support object-map feature from rbd kernel client
I thought that object-map, introduced with Jewel, was included with the 4.9 kernel and every kernel since then. On Tue, Aug 15, 2017, 7:26 AM Shinobu Kinjo wrote: > It would be much better to explain why as of today, object-map feature > is not supported by the kernel client, or document it. > > On Tue, Aug 15, 2017 at 8:08 PM, Ilya Dryomov wrote: > > On Tue, Aug 15, 2017 at 11:34 AM, moftah moftah > wrote: > >> Hi All, > >> > >> I have search everywhere for some sort of table that show kernel > version to > >> what rbd image features supported and didnt find any. > >> > >> basically I am looking at latest kernels from kernel.org , and i am > thinking > >> of upgrading to 4.12 since it is stable but i want to make sure i can > get > >> rbd images with object-map features working with rbd.ko > >> > >> if anyone know please let me know what kernel version i have to upgrade > to > >> to get that feature supported by kernel client > > > > As of today, object-map feature is not supported by the kernel client. > > > > Thanks, > > > > Ilya > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph Cluster attempt to access beyond end of device
The error found in that thread, iirc, is that the block size of the disk does not match the block size of the FS and is trying to access the rest of a block at the end of a disk. I also remember that the error didn't cause any problems. Why raid 6? Rebuilding a raid 6 seems like your cluster would have worse degraded performance while rebuilding the raid after a dead drive than if you only had individual osds and list a drive. I suppose you wouldn't be in a situation of the cluster seeing degraded objects/PGs, so if that is your use case need them it makes sense. From a cross architecture sense, it doesn't make sense. On Tue, Aug 15, 2017, 5:39 AM Hauke Homburg wrote: > Hello, > > > I found some error in the Cluster with dmes -T: > > attempt to access beyond end of device > > I found the following Post: > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39101.html > > Is this a Problem with the Size of the Filesystem itself oder "only" > eine Driver Bug? I ask becaue we habe in each Node 8 HDD with a Hardware > RAID 6 running. In this RAID we have the XFS Partition. > > Also we have one big Filesystem in 1 OSD in each Server instead of 1 > Filesystem per HDD at 8 HDD in each Server. > > greetings > > Hauke > > > -- > www.w3-creative.de > > www.westchat.de > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous OSD startup errors
I believe this is a known issue [1] and that there will potentially be a new 12.1.4 RC released because of it. The tracker ticket has a link to a set of development packages that should resolve the issue in the meantime. [1] http://tracker.ceph.com/issues/20985 On Tue, Aug 15, 2017 at 9:08 AM, Andras Pataki wrote: > After upgrading to the latest Luminous RC (12.1.3), all our OSD's are > crashing with the following assert: > > 0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: > In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, > coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, > bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, > std::set >*, bool) [with missing_type = > pg_missing_set; std::ostringstream = std::basic_ostringstream]' > thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: > 1301: FAILED assert(force_rebuild_missing) > > ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous > (rc) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x110) [0x55d0f2be3b50] > 2: (void PGLog::read_log_and_missing >(ObjectStore*, > coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, > pg_missing_set&, bool, std::basic_ostringstream std::char_traits, std::allocator >&, bool, bool*, > DoutPrefixProvider const*, std::set, > std::allocator >*, bool)+0x773) [0x55d0f276f013] > 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b) > [0x55d0f272739b] > 4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea] > 5: (OSD::init()+0x2179) [0x55d0f268c319] > 6: (main()+0x2def) [0x55d0f2591ccf] > 7: (__libc_start_main()+0xf5) [0x7f9b727d6b35] > 8: (()+0x4ac006) [0x55d0f2630006] > > Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in > read_log_missing) was: > > if (p->key() == "divergent_priors") { > ::decode(divergent_priors, bp); > ldpp_dout(dpp, 20) << "read_log_and_missing " << > divergent_priors.size() > << " divergent_priors" << dendl; > has_divergent_priors = true; > debug_verify_stored_missing = false; > > to > > if (p->key() == "divergent_priors") { > ::decode(divergent_priors, bp); > ldpp_dout(dpp, 20) << "read_log_and_missing " << > divergent_priors.size() > << " divergent_priors" << dendl; > assert(force_rebuild_missing); > debug_verify_stored_missing = false; > > and it seems like force_rebuild_missing is not being set. > > This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3. > So it seems something didn't happen correctly during the upgrade. Any ideas > how to fix it? > > Andras > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous OSD startup errors
After upgrading to the latest Luminous RC (12.1.3), all our OSD's are crashing with the following assert: 0> 2017-08-15 08:28:49.479238 7f9b7615cd00 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, bool, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set >*, bool) [with missing_type = pg_missing_set; std::ostringstream = std::basic_ostringstream]' thread 7f9b7615cd00 time 2017-08-15 08:28:49.477367 */home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.3/rpm/el7/BUILD/ceph-12.1.3/src/osd/PGLog.h: 1301: FAILED assert(force_rebuild_missing)* ceph version 12.1.3 (c56d9c07b342c08419bbc18dcf2a4c5fae62b9cf) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x55d0f2be3b50] 2: (void PGLog::read_log_and_missing >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set&, bool, std::basic_ostringstream, std::allocator >&, bool, bool*, DoutPrefixProvider const*, std::set, std::allocator >*, bool)+0x773) [0x55d0f276f013] 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x52b) [0x55d0f272739b] 4: (OSD::load_pgs()+0x97a) [0x55d0f2673dea] 5: (OSD::init()+0x2179) [0x55d0f268c319] 6: (main()+0x2def) [0x55d0f2591ccf] 7: (__libc_start_main()+0xf5) [0x7f9b727d6b35] 8: (()+0x4ac006) [0x55d0f2630006] Looking at the code in PGLog.h, the change from 12.1.2 to 12.1.3 (in read_log_missing) was: if (p->key() == "divergent_priors") { ::decode(divergent_priors, bp); ldpp_dout(dpp, 20) << "read_log_and_missing " << divergent_priors.size() << " divergent_priors" << dendl; has_divergent_priors = true; debug_verify_stored_missing = false; to if (p->key() == "divergent_priors") { ::decode(divergent_priors, bp); ldpp_dout(dpp, 20) << "read_log_and_missing " << divergent_priors.size() << " divergent_priors" << dendl; assert(force_rebuild_missing); debug_verify_stored_missing = false; and it seems like force_rebuild_missing is not being set. This cluster was upgraded from Jewel to 12.1.1, then 12.1.2 and now 12.1.3. So it seems something didn't happen correctly during the upgrade. Any ideas how to fix it? Andras ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version support object-map feature from rbd kernel client
It would be much better to explain why as of today, object-map feature is not supported by the kernel client, or document it. On Tue, Aug 15, 2017 at 8:08 PM, Ilya Dryomov wrote: > On Tue, Aug 15, 2017 at 11:34 AM, moftah moftah wrote: >> Hi All, >> >> I have search everywhere for some sort of table that show kernel version to >> what rbd image features supported and didnt find any. >> >> basically I am looking at latest kernels from kernel.org , and i am thinking >> of upgrading to 4.12 since it is stable but i want to make sure i can get >> rbd images with object-map features working with rbd.ko >> >> if anyone know please let me know what kernel version i have to upgrade to >> to get that feature supported by kernel client > > As of today, object-map feature is not supported by the kernel client. > > Thanks, > > Ilya > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cluster unavailable for 20 mins when downed server was reintroduced
Luminous 12.1.1 rc1 Hi, I have a three node cluster with 6 OSD and 1 mon per node. I had to turn off one node for rack reasons. While the node was down, the cluster was still running and accepting files via radosgw. However, when I turned the machine back on, radosgw uploads stopped working and things like "ceph status" starting timed out. It took 20 minutes for "ceph status" to be OK. In the recent past I've rebooted one or other node and the cluster kept working, and when the machine came back, the OSDs and monitor rejoined the cluster and things went on as usual. The machine was off for 21 hours or so. Any idea what might be happening, and how to mitigate the effects of this next time a machine has to be down for any length of time? "ceph status" said: 2017-08-15 11:28:29.835943 7fdf2d74b700 0 monclient(hunting): authenticate timed out after 3002017-08-15 11:28:29.835993 7fdf2d74b700 0 librados: client.admin authentication error (110) Connection timed out monitor log said things like this before everything came together: 2017-08-15 11:23:07.180123 7f11c0fcc700 0 -- 172.16.0.43:0/2471 >> 172.16.0.45:6812/1904 conn(0x556eeaf4d000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER but "ceph --admin-daemon /var/run/ceph/ceph-mon.xxx.asok quorum_status" did work. This monitor node was detected but not yet in quorum. OSDs had 15 minutes of ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-9: (2) No such file or directory before becoming available. Advice welcome. Thanks, Sean Purdy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version support object-map feature from rbd kernel client
On Tue, Aug 15, 2017 at 11:34 AM, moftah moftah wrote: > Hi All, > > I have search everywhere for some sort of table that show kernel version to > what rbd image features supported and didnt find any. > > basically I am looking at latest kernels from kernel.org , and i am thinking > of upgrading to 4.12 since it is stable but i want to make sure i can get > rbd images with object-map features working with rbd.ko > > if anyone know please let me know what kernel version i have to upgrade to > to get that feature supported by kernel client As of today, object-map feature is not supported by the kernel client. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Jewel -> Luminous on Debian 9.1
Dajka Tamás writes: > Dear All, > > > > I'm trying to upgrade our env. from Jewel to the latest RC. Packages are > installed (latest 12.1.3), but I'm unable to install the mgr. I've tried the > following (nodes in cluster are from 03-05, 03 is the admin node): > > > > root@stornode03:/etc/ceph# ceph-deploy -v mgr create stornode03 stornode04 > stornode05 > > [ceph_deploy.conf][DEBUG ] found configuration file at: > /root/.cephdeploy.conf > > [ceph_deploy.cli][INFO ] Invoked (1.5.38): /usr/bin/ceph-deploy -v mgr > create stornode03 stornode04 stornode05 > > [ceph_deploy.cli][INFO ] ceph-deploy options: > > [ceph_deploy.cli][INFO ] username : None > > [ceph_deploy.cli][INFO ] verbose : True > > [ceph_deploy.cli][INFO ] mgr : [('stornode03', > 'stornode03'), ('stornode04', 'stornode04'), ('stornode05', 'stornode05')] > > [ceph_deploy.cli][INFO ] overwrite_conf: False > > [ceph_deploy.cli][INFO ] subcommand: create > > [ceph_deploy.cli][INFO ] quiet : False > > [ceph_deploy.cli][INFO ] cd_conf : > > > [ceph_deploy.cli][INFO ] cluster : ceph > > [ceph_deploy.cli][INFO ] func : 0x7f07b31712a8> > > [ceph_deploy.cli][INFO ] ceph_conf : None > > [ceph_deploy.cli][INFO ] default_release : False > > [ceph_deploy.mgr][DEBUG ] Deploying mgr, cluster ceph hosts > stornode03:stornode03 stornode04:stornode04 stornode05:stornode05 > > [ceph_deploy][ERROR ] RuntimeError: bootstrap-mgr keyring not found; run > 'gatherkeys' > > > > root@stornode03:/etc/ceph# ceph-deploy -v gatherkeys stornode03 stornode04 > stornode05 > > [ceph_deploy.conf][DEBUG ] found configuration file at: > /root/.cephdeploy.conf > > [ceph_deploy.cli][INFO ] Invoked (1.5.38): /usr/bin/ceph-deploy -v > gatherkeys stornode03 stornode04 stornode05 > > [ceph_deploy.cli][INFO ] ceph-deploy options: > > [ceph_deploy.cli][INFO ] username : None > > [ceph_deploy.cli][INFO ] verbose : True > > [ceph_deploy.cli][INFO ] overwrite_conf: False > > [ceph_deploy.cli][INFO ] quiet : False > > [ceph_deploy.cli][INFO ] cd_conf : > > > [ceph_deploy.cli][INFO ] cluster : ceph > > [ceph_deploy.cli][INFO ] mon : ['stornode03', > 'stornode04', 'stornode05'] > > [ceph_deploy.cli][INFO ] func : gatherkeys at 0x7fac1c8d0aa0> > > [ceph_deploy.cli][INFO ] ceph_conf : None > > [ceph_deploy.cli][INFO ] default_release : False > > [ceph_deploy.gatherkeys][INFO ] Storing keys in temp directory > /tmp/tmpQCCwSb > > [stornode03][DEBUG ] connected to host: stornode03 > > [stornode03][DEBUG ] detect platform information from remote host > > [ceph_deploy.gatherkeys][INFO ] Destroy temp directory /tmp/tmpQCCwSb > > [ceph_deploy][ERROR ] UnsupportedPlatform: Platform is not supported: debian > 9.1 > > > > root@stornode03:/etc/ceph# > This seems to be fixed in ceph-deploy via https://github.com/ceph/ceph-deploy/pull/447, can you try ceph-deploy from master -- Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph Cluster attempt to access beyond end of device
Hello, I found some error in the Cluster with dmes -T: attempt to access beyond end of device I found the following Post: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39101.html Is this a Problem with the Size of the Filesystem itself oder "only" eine Driver Bug? I ask becaue we habe in each Node 8 HDD with a Hardware RAID 6 running. In this RAID we have the XFS Partition. Also we have one big Filesystem in 1 OSD in each Server instead of 1 Filesystem per HDD at 8 HDD in each Server. greetings Hauke -- www.w3-creative.de www.westchat.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] which kernel version support object-map feature from rbd kernel client
Hi All, I have search everywhere for some sort of table that show kernel version to what rbd image features supported and didnt find any. basically I am looking at latest kernels from kernel.org , and i am thinking of upgrading to 4.12 since it is stable but i want to make sure i can get rbd images with object-map features working with rbd.ko if anyone know please let me know what kernel version i have to upgrade to to get that feature supported by kernel client Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com