Wow. Why would that take so long? I think you are correct that it's only used for metadata, we could just add a config value to disable it. -Sam
On Wed, Sep 23, 2015 at 3:48 PM, Somnath Roy <[email protected]> wrote: > Sam/Sage, > I debugged it down and found out that the > get_device_by_uuid->blkid_find_dev_with_tag() call within > FileStore::collect_metadata() is hanging for ~3 mins before returning a > EINVAL. I saw this portion is newly added after hammer. > Commenting it out resolves the issue. BTW, I saw this value is stored as > metadata but not used anywhere , am I missing anything ? > Here is my Linux details.. > > root@emsnode5:~/wip-write-path-optimization/src# uname -a > Linux emsnode5 3.16.0-38-generic #52~14.04.1-Ubuntu SMP Fri May 8 09:43:57 > UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > > > root@emsnode5:~/wip-write-path-optimization/src# lsb_release -a > No LSB modules are available. > Distributor ID: Ubuntu > Description: Ubuntu 14.04.2 LTS > Release: 14.04 > Codename: trusty > > Thanks & Regards > Somnath > > -----Original Message----- > From: Somnath Roy > Sent: Wednesday, September 16, 2015 2:20 PM > To: 'Gregory Farnum' > Cc: 'ceph-devel' > Subject: RE: Very slow recovery/peering with latest master > > > Sage/Greg, > > Yeah, as we expected, it is not happening probably because of recovery > settings. I reverted it back in my ceph.conf , but, still seeing this problem. > > Some observation : > ---------------------- > > 1. First of all, I don't think it is something related to my environment. I > recreated the cluster with Hammer and this problem is not there. > > 2. I have enabled the messenger/monclient log (Couldn't attach here) in one > of the OSDs and found monitor is taking long time to detect the up OSDs. If > you see the log, I have started OSD at 2015-09-16 16:13:07.042463 , but, > there is no communication (only getting KEEP_ALIVE) till 2015-09-16 > 16:16:07.180482 , so, 3 mins !! > > 3. During this period, I saw monclient trying to communicate with monitor but > not able to probably. It is sending osd_boot at 2015-09-16 16:16:07.180482 > only.. > > 2015-09-16 16:16:07.180450 7f65377fe700 10 monclient: _send_mon_message to > mon.a at 10.60.194.10:6789/0 > 2015-09-16 16:16:07.180482 7f65377fe700 1 -- 10.60.194.10:6820/20102 --> > 10.60.194.10:6789/0 -- osd_boot(osd.10 booted 0 features 72057594037927935 > v45) v6 -- ?+0 0x7f6523c19100 con 0x7f6542045680 > 2015-09-16 16:16:07.180496 7f65377fe700 20 -- 10.60.194.10:6820/20102 > submit_message osd_boot(osd.10 booted 0 features 72057594037927935 v45) v6 > remote, 10.60.194.10:6789/0, have pipe. > > 4. BTW, the osd down scenario is detected very quickly (ceph -w output) , > problem is during coming up I guess. > > > So, something related to mon communication getting slower ? > Let me know if more verbose logging is required and how should I share the > log.. > > Thanks & Regards > Somnath > > -----Original Message----- > From: Gregory Farnum [mailto:[email protected]] > Sent: Wednesday, September 16, 2015 11:35 AM > To: Somnath Roy > Cc: ceph-devel > Subject: Re: Very slow recovery/peering with latest master > > On Tue, Sep 15, 2015 at 8:04 PM, Somnath Roy <[email protected]> wrote: >> Hi, >> I am seeing very slow recovery when I am adding OSDs with the latest master. >> Also, If I just restart all the OSDs (no IO is going on in the cluster) , >> cluster is taking a significant amount of time to reach in active+clean >> state (and even detecting all the up OSDs). >> >> I saw the recovery/backfill default parameters are now changed (to lower >> value) , this probably explains the recovery scenario , but, will it affect >> the peering time during OSD startup as well ? > > I don't think these values should impact peering time, but you could > configure them back to the old defaults and see if it changes. > -Greg > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
