Yes -- the crashed server also mounted cephfs as a client, and also likely
had active writes to the file when it crashed.

I have the max file size set to  17,592,186,044,416 -- but this file was
about 5.8TB.

The likely reason for the crash? The file was mounted as a fileio backstore
to LIO, which was exported as an FC lun that I had connected to an ESXi
server, mapped via RDM to a guest, in which I had a dd if=/dev/zero
of=/dev/sdb bs=1M count=6000000000000 running(for several hours)


Which I think was breaking at least 3 "don't do this" rules with ceph. Once
it moves into production the pieces will be separated.










On Wed, May 24, 2017 at 4:55 PM, Gregory Farnum <[email protected]> wrote:

> On Wed, May 24, 2017 at 3:15 AM, John Spray <[email protected]> wrote:
> > On Tue, May 23, 2017 at 11:41 PM, Daniel K <[email protected]> wrote:
> >> Have a 20 OSD cluster -"my first ceph cluster" that has another 400 OSDs
> >> enroute.
> >>
> >> I was "beating up" on the cluster, and had been writing to a 6TB file in
> >> CephFS for several hours, during which I changed the crushmap to better
> >> match my environment, generating a bunch of recovery IO. After about
> 5.8TB
> >> written, one of the OSD(which is also a MON..soon to be rectivied) hosts
> >> crashed that hat 5 OSDs on it, and after rebooting, I have this in ceph
> -s:
> >> (The degraded/misplaced warnings are likely because the cluster hasn't
> >> completed rebalancing after I changed the crushmap I assume)
> >>
> >
> > Losing a quarter of your OSDs down while simultaneously rebalancing
> > after editing your CRUSH map is a brutal thing to a Ceph cluster, and
> > I would expect it to impact your client IO severely.
> >
> > I see that you've got 112MB/s of recovery going on, which may or may
> > not be saturating some links depending on whether you're using 1gig or
> > 10gig networking.
> >
> >> 2017-05-23 18:33:13.775924 7ff9d3230700 -1 WARNING: the following
> dangerous
> >> and experimental features are enabled: bluestore
> >> 2017-05-23 18:33:13.781732 7ff9d3230700 -1 WARNING: the following
> dangerous
> >> and experimental features are enabled: bluestore
> >>     cluster e92e20ca-0fe6-4012-86cc-aa51e0466661
> >>      health HEALTH_WARN
> >>             440 pgs backfill_wait
> >>             7 pgs backfilling
> >>             85 pgs degraded
> >>             5 pgs recovery_wait
> >>             85 pgs stuck degraded
> >>             452 pgs stuck unclean
> >>             77 pgs stuck undersized
> >>             77 pgs undersized
> >>             recovery 196526/3554278 objects degraded (5.529%)
> >>             recovery 1690392/3554278 objects misplaced (47.559%)
> >>             mds0: 1 slow requests are blocked > 30 sec
> >>      monmap e4: 3 mons at
> >> {stor-vm1=10.0.15.51:6789/0,stor-vm2=10.0.15.52:6789/0,
> stor-vm3=10.0.15.53:6789/0}
> >>             election epoch 136, quorum 0,1,2 stor-vm1,stor-vm2,stor-vm3
> >>       fsmap e21: 1/1/1 up {0=stor-vm4=up:active}
> >>         mgr active: stor-vm1 standbys: stor-vm2
> >>      osdmap e4655: 20 osds: 20 up, 20 in; 450 remapped pgs
> >>             flags sortbitwise,require_jewel_osds,require_kraken_osds
> >>       pgmap v192589: 1428 pgs, 5 pools, 5379 GB data, 1345 kobjects
> >>             11041 GB used, 16901 GB / 27943 GB avail
> >>             196526/3554278 objects degraded (5.529%)
> >>             1690392/3554278 objects misplaced (47.559%)
> >>                  975 active+clean
> >>                  364 active+remapped+backfill_wait
> >>                   76 active+undersized+degraded+remapped+backfill_wait
> >>                    3 active+recovery_wait+degraded+remapped
> >>                    3 active+remapped+backfilling
> >>                    3 active+degraded+remapped+backfilling
> >>                    2 active+recovery_wait+degraded
> >>                    1 active+clean+scrubbing+deep
> >>                    1 active+undersized+degraded+remapped+backfilling
> >> recovery io 112 MB/s, 28 objects/s
> >>
> >>
> >> Seems related to the "corrupted rbd filesystems since jewel" thread.
> >>
> >>
> >> log entries on the MDS server:
> >>
> >> 2017-05-23 18:27:12.966218 7f95ed6c0700  0 log_channel(cluster) log
> [WRN] :
> >> slow request 243.113407 seconds old, received at 2017-05-23
> 18:23:09.852729:
> >> client_request(client.204100:5 getattr pAsLsXsFs #100000003ec 2017-05-23
> >> 17:48:23.770852 RETRY=2 caller_uid=0, caller_gid=0{}) currently failed
> to
> >> rdlock, waiting
> >>
> >>
> >> output of ceph daemon mds.stor-vm4 objecter_requests(changes each time
> I run
> >> it)
> >
> > If that changes each time you run it then it means the OSD requests
> > from the MDS are happening.
> >
> > However, it's possible that you have multiple clients and one of them
> > is stuck trying to write something back (to a PG that is not accepting
> > the write (yet?)), and thereby preventing the MDS from granting a lock
> > for another client.
>
> Given that the MDS is running a stat on the object, and the mentioned
> server crash, I'm assuming it's probing the file size. Daniel, was
> that crashed server also mounting CephFS as a client? (Or did you have
> another client that went away?)
>
> (Also note that if I'm interpreting it correctly, that object is the
> 0x3efb9f=4127647th in the file, or about 17TB worth. That doesn't look
> like a journal ino to me so it must be a very large user file?)
> -Greg
>
> >
> > What clients (+versions) are involved, what's the workload, what
> > versions of Ceph?
> >
> > John
> >
> >> :
> >> root@stor-vm4:/var/log/ceph# ceph daemon mds.stor-vm4 objecter_requests
> >> {
> >>     "ops": [
> >>         {
> >>             "tid": 66700,
> >>             "pg": "1.60e95c32",
> >>             "osd": 4,
> >>             "object_id": "100000003ec.003efb9f",
> >>             "object_locator": "@1",
> >>             "target_object_id": "100000003ec.003efb9f",
> >>             "target_object_locator": "@1",
> >>             "paused": 0,
> >>             "used_replica": 0,
> >>             "precalc_pgid": 0,
> >>             "last_sent": "1.47461e+06s",
> >>             "attempts": 1,
> >>             "snapid": "head",
> >>             "snap_context": "0=[]",
> >>             "mtime": "1969-12-31 19:00:00.000000s",
> >>             "osd_ops": [
> >>                 "stat"
> >>             ]
> >>         }
> >>     ],
> >>     "linger_ops": [],
> >>     "pool_ops": [],
> >>     "pool_stat_ops": [],
> >>     "statfs_ops": [],
> >>     "command_ops": []
> >> }
> >>
> >>
> >> I've tried restarting the mds daemon ( systemctl stop ceph-mds\*.service
> >> ceph-mds.target &&  systemctl start ceph-mds\*.service ceph-mds.target )
> >>
> >>
> >>
> >> IO to the file that was being access when the host crashed is blocked.
> >>
> >>
> >> Suggestions?
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> [email protected]
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to