Re: [ceph-users] cluster unavailable for 20 mins when downed server was reintroduced

Gregory Farnum Wed, 16 Aug 2017 16:07:16 -0700

On Wed, Aug 16, 2017 at 4:04 AM Sean Purdy <[email protected]> wrote:


> On Tue, 15 Aug 2017, Gregory Farnum said:
> > On Tue, Aug 15, 2017 at 4:23 AM Sean Purdy <[email protected]>
> wrote:
> > > I have a three node cluster with 6 OSD and 1 mon per node.
> > >
> > > I had to turn off one node for rack reasons.  While the node was down,
> the
> > > cluster was still running and accepting files via radosgw.  However,
> when I
> > > turned the machine back on, radosgw uploads stopped working and things
> like
> > > "ceph status" starting timed out.  It took 20 minutes for "ceph
> status" to
> > > be OK.
>
> > > 2017-08-15 11:28:29.835943 7fdf2d74b700  0 monclient(hunting):
> > > authenticate timed out after 300                        2017-08-15
> > > 11:28:29.835993 7fdf2d74b700  0 librados: client.admin authentication
> error
> > > (110) Connection timed out
> > >
> >
> > That just means the client couldn't connect to an in-quorum monitor. It
> > should have tried them all in sequence though — did you check if you had
> > *any* functioning quorum?
>
> There was a functioning quorum - I checked with "ceph --admin-daemon
> /var/run/ceph/ceph-mon.xxx.asok quorum_status".  Well - I interpreted the
> output as functioning.  There was a nominated leader.
>

Did you try running "ceph -s" from more than one location? If you had a
functioning quorum that should have worked. And any live clients should
have been able to keep working.


>
>
> > > 2017-08-15 11:23:07.180123 7f11c0fcc700  0 -- 172.16.0.43:0/2471 >>
> > > 172.16.0.45:6812/1904 conn(0x556eeaf4d000 :-1
> > > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
> > > l=0).handle_connect_reply connect got BADAUTHORIZER
> > >
> >
> > This one's odd. We did get one report of seeing something like that, but
> I
> > tend to think it's a clock sync issue.
>
> I saw some messages about clock sync, but ntpq -p looked OK on each
> server.  Will investigate further.
>
>      remote           refid      st t when poll reach   delay   offset
> jitter
>
> ==============================================================================
> +172.16.0.16     129.250.35.250   3 u  847 1024  377    0.289    1.103
>  0.376
> +172.16.0.18     80.82.244.120    3 u   93 1024  377    0.397   -0.653
>  1.040
> *172.16.0.19     158.43.128.33    2 u  279 1024  377    0.244    0.262
>  0.158
>
>
> > > ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-9: (2)
> No
> > > such file or directory
> > >
> > And that would appear to be something happening underneath Ceph, wherein
> > your data wasn't actually all the way mounted or something?
>
> It's the machine mounting the disks at boot time - udev or ceph-osd.target
> keeps retrying until eventually the disk/OSD is mounted.  Or eventually it
> gives up.  Do the OSDs need a monitor quorum at startup?  It kept
> restarting OSDs for 20 mins.
>

I think they'll keep trying to connect but they may eventually time out; or
if they get a sufficiently mean response (such as BADAUTHORIZER) they may
shut down on their own.


>
> Timing went like this:
>
> 11:22 node boot
> 11:22 ceph-mon starts, recovers logs, compaction, first BADAUTHORIZER
> message
> 11:22 starting disk activation for 18 partitions (3 per bluestore)
> 11:23 mgr on other node can't find secret_id
> 11:43 bluefs mount succeeded on OSDs, ceph-osds go live
> 11:45 last BADAUTHORIZER message in monitor log
> 11:45 this host calls and wins a monitor election, mon_down health check
> clears
> 11:45 mgr happy
>

The timing there on the mounting (how does it take 20 minutes?!?!?) and
everything working again certainly is suspicious. It's not the direct cause
of the issue, but there may be something else going on which is causing
both of them.

All in all; I'm confused. The monitor being on ext4 can't influence this in
any way I can imagine.
-Greg


>
>
> > Anyway, it should have survived that transition without any noticeable
> > impact (unless you are running so close to capacity that merely getting
> the
> > downed node up-to-date overwhelmed your disks/cpu). But without some
> basic
> > information about what the cluster as a whole was doing I couldn't
> > speculate.
>
> This is a brand new 3 node cluster.  Dell R720 running Debian 9 with 2x
> SSD for OS and ceph-mon, 6x 2Tb SATA for ceph-osd using bluestore, per
> node.  Running radosgw as object store layer.  Only activity is a
> single-threaded test job uploading millions of small files over S3.  There
> are about 5.5million test objects so far (additionally 3x replication).
> This job was fine when the machine was down, stalled when machine booted.
>
> Looking at activity graphs at the time, there didn't seem to be a network
> bottleneck or CPU issue or disk throughput bottleneck.  But I'll look a bit
> closer.
>
> ceph-mon is on an ext4 filesystem though.   Perhaps I should move this to
> xfs?  Bluestore is xfs+bluestore.
>
> I presume it's a monitor issue somehow.
>
>
> > -Greg
>
> Thanks for your input.
>
> Sean
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cluster unavailable for 20 mins when downed server was reintroduced

Reply via email to