Hi,

you could try reducing "osd map message max", some code paths that end up
as -EIO (kernel: libceph: mon1 *** io error) is exceeding
include/linux/ceph/libceph.h:CEPH_MSG_MAX_{FRONT,MIDDLE,DATA}_LEN.

This "worked for us" - YMMV.

-KJ

On Tue, Jan 15, 2019 at 6:14 AM Andras Pataki <[email protected]>
wrote:

> An update on our cephfs kernel client troubles.  After doing some heavier
> testing with a newer kernel 4.19.13, it seems like it also gets into a bad
> state when it can't connect to monitors (all back end processes are on
> 12.2.8):
>
> Jan 15 08:49:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> established
> Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
> Jan 15 08:49:01 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> lost, hunting for new mon
> Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> established
> Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
> Jan 15 08:49:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> lost, hunting for new mon
> Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> established
> Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
> Jan 15 08:49:04 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> lost, hunting for new mon
> Jan 15 08:49:06 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> established
> Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
> Jan 15 08:49:07 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> lost, hunting for new mon
> ... repeating forever ...
>
> # uname -r
> 4.19.13
>
> and on the mon node (10.128.150.10) at log level 20, I see that it is
> building/encoding a lot of maps (10.128.36.18 is the client in question):
> 2019-01-15 08:49:01.355017 7fffee40c700 10 mon.cephmon00@0(leader) e40
> _ms_dispatch new session 0x555562dc6c00 MonSession(client.36800361
> 10.128.36.18:0/2911716500 is open , features 0x27018fb86aa42ada (jewel))
> features 0x27018fb86aa42ada
> 2019-01-15 08:49:01.355021 7fffee40c700 20 mon.cephmon00@0(leader) e40
> caps
> 2019-01-15 08:49:01.355026 7fffee40c700 10 mon.cephmon00@0(leader).auth
> v58457 preprocess_query auth(proto 0 34 bytes epoch 0) from client.36800361
> 10.128.36.18:0/2911716500
> <snip>
> 2019-01-15 08:49:01.355817 7fffee40c700 10 mon.cephmon00@0(leader).osd
> e1254390 check_osdmap_sub 0x555565373340 next 1254102 (onetime)
> 2019-01-15 08:49:01.355819 7fffee40c700  5 mon.cephmon00@0(leader).osd
> e1254390 send_incremental [1254102..1254390] to client.36800361
> 10.128.36.18:0/2911716500
> 2019-01-15 08:49:01.355821 7fffee40c700 10 mon.cephmon00@0(leader).osd
> e1254390 build_incremental [1254102..1254141] with features 27018fb86aa42ada
> 2019-01-15 08:49:01.364859 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254141 with features 504412504116439552
> 2019-01-15 08:49:01.372131 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incremental    inc 1254141 1237271 bytes
> 2019-01-15 08:49:01.372180 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254140 with features 504412504116439552
> 2019-01-15 08:49:01.372187 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incremental    inc 1254140 260 bytes
> 2019-01-15 08:49:01.380981 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254139 with features 504412504116439552
> 2019-01-15 08:49:01.387983 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incremental    inc 1254139 1237351 bytes
> 2019-01-15 08:49:01.388043 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254138 with features 504412504116439552
> 2019-01-15 08:49:01.388049 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incremental    inc 1254138 232 bytes
> 2019-01-15 08:49:01.396781 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254137 with features 504412504116439552
> <snip> ... a lot more of similar messages
> 2019-01-15 08:49:04.210936 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 reencode_incremental_map 1254382 with features 504412504116439552
> 2019-01-15 08:49:04.211032 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1254390 build_incremental    inc 1254382 232 bytes
> 2019-01-15 08:49:04.211066 7fffee40c700 10 mon.cephmon00@0(leader) e40
> ms_handle_reset 0x55556450f800 10.128.36.18:0/2911716500
> 2019-01-15 08:49:04.211070 7fffee40c700 10 mon.cephmon00@0(leader) e40
> reset/close on session client.36800361 10.128.36.18:0/2911716500
> 2019-01-15 08:49:04.211073 7fffee40c700 10 mon.cephmon00@0(leader) e40
> remove_session 0x555562dc6c00 client.36800361 10.128.36.18:0/2911716500
> features 0x27018fb86aa42ada
>
> looks like the client disconnects (either for waiting too long, or for
> some protocol error?).  Any hints on why so many maps need to be reencoded
> (to jewel), or how to improve this behavior would be much appreciated.  We
> would really be interested in using the kernel client instead of fuse, but
> this seems to be a stumbling block.
>
> Thanks,
>
> Andras
>
>
> On 1/3/19 6:49 AM, Andras Pataki wrote:
>
> I wonder if anyone could offer any insight on the issue below, regarding
> the CentOS 7.6 kernel cephfs client connecting to a Luminous cluster.  I
> have since tried a much newer 4.19.13 kernel, which did not show the same
> issue (but unfortunately for various reasons unrelated to ceph, we can't go
> to such a new kernel).
>
> Am I reading it right that somehow the monitor thinks this kernel is old
> and needs to prepare special maps in some older format for it, and that
> takes too long and the kernel just gives up, or perhaps has some other
> communication protocol error?  It seems like one of these mon communication
> sessions only lasts half a second.  Then it reconnects to another mon, and
> gets the same result, etc.  Any way around this?
>
> Andras
>
>
> On 12/26/18 7:55 PM, Andras Pataki wrote:
>
> We've been using ceph-fuse with a pretty good stability record (against
> the Luminous 12.2.8 back end).  Unfortunately ceph-fuse has extremely poor
> small file performance (understandably), so we've been testing the kernel
> client.  The latest RedHat kernel 3.10.0-957.1.3.el7.x86_64 seems to work
> pretty well, as long as the cluster is running in a completely clean
> state.  However, it seems that as soon as there is something happening to
> the cluster, the kernel client crashes pretty badly.
>
> Today's example: I've reweighted some OSDs to balance the disk usage a bit
> (set nobackfill, reweight the OSDs, check the new hypothetical space usage,
> then unset nobackfill).   As soon as the reweighting procedure started, the
> kernel client went into an infinite loop trying to unsuccessfully connect
> to mons:
>
> Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
> Dec 26 19:28:53 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> lost, hunting for new mon
> Dec 26 19:28:53 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> established
> Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
> Dec 26 19:28:58 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> lost, hunting for new mon
> Dec 26 19:28:58 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> established
> Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
> Dec 26 19:28:59 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> lost, hunting for new mon
> Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> established
> Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
> Dec 26 19:28:59 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> lost, hunting for new mon
> Dec 26 19:28:59 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> established
> Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
> Dec 26 19:29:00 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> lost, hunting for new mon
> Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> established
> Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
> Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> lost, hunting for new mon
> Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> established
> Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
> Dec 26 19:29:00 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> lost, hunting for new mon
> Dec 26 19:29:00 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> established
> Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
> Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> lost, hunting for new mon
> Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> established
> Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 io error
> Dec 26 19:29:01 mon5 kernel: libceph: mon0 10.128.150.10:6789 session
> lost, hunting for new mon
> Dec 26 19:29:01 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> established
> Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 io error
> Dec 26 19:29:02 mon5 kernel: libceph: mon2 10.128.150.12:6789 session
> lost, hunting for new mon
> Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> established
> Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 io error
> Dec 26 19:29:02 mon5 kernel: libceph: mon1 10.128.150.11:6789 session
> lost, hunting for new mon
> ... etc ...
>
> seemingly never recovering.  The cluster is healthy, all other clients are
> successfully doing I/O:
>
> [root@cephmon00 ceph]# ceph -s
>   cluster:
>     id:     d7b33135-0940-4e48-8aa6-1d2026597c2f
>     health: HEALTH_WARN
>             noout flag(s) set
>             1 backfillfull osd(s)
>             4 pool(s) backfillfull
>             239119058/12419244975 objects misplaced (1.925%)
>
>   services:
>     mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
>     mgr: cephmon00(active)
>     mds: cephfs-1/1/1 up  {0=cephmds00=up:active}, 1 up:standby
>     osd: 3534 osds: 3534 up, 3534 in; 5040 remapped pgs
>          flags noout
>
>   data:
>     pools:   5 pools, 50688 pgs
>     objects: 2.51G objects, 7.77PiB
>     usage:   19.1PiB used, 5.90PiB / 25.0PiB avail
>     pgs:     239119058/12419244975 objects misplaced (1.925%)
>              45639 active+clean
>              2914  active+remapped+backfilling
>              2126  active+remapped+backfill_wait
>              9     active+clean+scrubbing+deep
>
>   io:
>     client:   10.3MiB/s rd, 2.23GiB/s wr, 32op/s rd, 3.77kop/s wr
>     recovery: 114GiB/s, 33.12kobjects/s
>
>
> The client machine in question is otherwise healthy also (not out of
> memory, etc.).  I've checked the osd blacklist, nothing on that.  Nothing
> suspicious in the mon logs regarding this client either.  Setting one of
> the mons to log level 20, the relevant section when this client tries to
> connect:
>
> 2018-12-26 19:51:46.203264 7fffeb406700 10 mon.cephmon00@0(leader) e40
> ms_verify_authorizer 10.128.36.18:0/3882984371 client protocol 0
> 2018-12-26 19:51:46.206645 7fffee40c700 10 mon.cephmon00@0(leader) e40
> _ms_dispatch new session 0x5555722bb980 MonSession(client.36398604
> 10.128.36.18:0/3882984371 is open , features 0x27018fb86aa42ada (jewel))
> features 0x27018fb86aa42ada
> 2018-12-26 19:51:46.206651 7fffee40c700 20 mon.cephmon00@0(leader) e40
> caps
> 2018-12-26 19:51:46.206657 7fffee40c700 10 mon.cephmon00@0(leader).auth
> v57602 preprocess_query auth(proto 0 34 bytes epoch 0) from client.36398604
> 10.128.36.18:0/3882984371
> 2018-12-26 19:51:46.206660 7fffee40c700 10 mon.cephmon00@0(leader).auth
> v57602 prep_auth() blob_size=34
> 2018-12-26 19:51:46.206692 7fffee40c700  2 mon.cephmon00@0(leader) e40
> send_reply 0x555562fee5a0 0x55556f398b40 auth_reply(proto 2 0 (0) Success)
> v1
> 2018-12-26 19:51:46.206867 7fffee40c700 20 mon.cephmon00@0(leader) e40
> _ms_dispatch existing session 0x5555722bb980 for client.36398604
> 10.128.36.18:0/3882984371
> 2018-12-26 19:51:46.206870 7fffee40c700 20 mon.cephmon00@0(leader) e40
> caps
> 2018-12-26 19:51:46.206877 7fffee40c700 10 mon.cephmon00@0(leader).auth
> v57602 preprocess_query auth(proto 2 128 bytes epoch 0) from
> client.36398604 10.128.36.18:0/3882984371
> 2018-12-26 19:51:46.206882 7fffee40c700 10 mon.cephmon00@0(leader).auth
> v57602 prep_auth() blob_size=128
> 2018-12-26 19:51:46.206979 7fffee40c700  2 mon.cephmon00@0(leader) e40
> send_reply 0x555562fee5a0 0x55571f9c4d80 auth_reply(proto 2 0 (0) Success)
> v1
> 2018-12-26 19:51:46.207178 7fffee40c700 20 mon.cephmon00@0(leader) e40
> _ms_dispatch existing session 0x5555722bb980 for client.36398604
> 10.128.36.18:0/3882984371
> 2018-12-26 19:51:46.207182 7fffee40c700 20 mon.cephmon00@0(leader) e40
> caps allow *
> 2018-12-26 19:51:46.207184 7fffee40c700 20 is_capable service=mon command=
> read on cap allow *
> 2018-12-26 19:51:46.207185 7fffee40c700 20  allow so far , doing grant
> allow *
> 2018-12-26 19:51:46.207186 7fffee40c700 20  allow all
> 2018-12-26 19:51:46.207188 7fffee40c700 10 mon.cephmon00@0(leader) e40
> handle_subscribe mon_subscribe({mdsmap=72103+,monmap=41+,osdmap=1207891}) v2
> 2018-12-26 19:51:46.207194 7fffee40c700 10 mon.cephmon00@0(leader) e40
> handle_subscribe: MDS sub 'mdsmap'
> 2018-12-26 19:51:46.207195 7fffee40c700 20 is_capable service=mds command=
> read on cap allow *
> 2018-12-26 19:51:46.207197 7fffee40c700 20  allow so far , doing grant
> allow *
> 2018-12-26 19:51:46.207198 7fffee40c700 20  allow all
> 2018-12-26 19:51:46.207200 7fffee40c700 20 mon.cephmon00@0(leader).mds
> e72102 check_sub: mdsmap
> 2018-12-26 19:51:46.207204 7fffee40c700 10 mon.cephmon00@0(leader).monmap
> v40 check_sub monmap next 41 have 40
> 2018-12-26 19:51:46.207207 7fffee40c700 20 is_capable service=osd command=
> read on cap allow *
> 2018-12-26 19:51:46.207208 7fffee40c700 20  allow so far , doing grant
> allow *
> 2018-12-26 19:51:46.207209 7fffee40c700 20  allow all
> 2018-12-26 19:51:46.207211 7fffee40c700 10 mon.cephmon00@0(leader).osd
> e1208017 check_osdmap_sub 0x55592f513d00 next 1207891 (onetime)
> 2018-12-26 19:51:46.207213 7fffee40c700  5 mon.cephmon00@0(leader).osd
> e1208017 send_incremental [1207891..1208017] to client.36398604
> 10.128.36.18:0/3882984371
> 2018-12-26 19:51:46.207217 7fffee40c700 10 mon.cephmon00@0(leader).osd
> e1208017 build_incremental [1207891..1207930] with features 27018fb86aa42ada
> 2018-12-26 19:51:46.220019 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1208017 reencode_incremental_map 1207930 with features 504412504116439552
> 2018-12-26 19:51:46.230217 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1208017 build_incremental    inc 1207930 1146701 bytes
> 2018-12-26 19:51:46.230349 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1208017 reencode_incremental_map 1207929 with features 504412504116439552
> 2018-12-26 19:51:46.232523 7fffee40c700 20 mon.cephmon00@0(leader).osd
> e1208017 build_incremental    inc 1207929 175613 bytes
> ... a lot more of reencode_incremental stuff ...
> 2018-12-26 19:51:46.745394 7fffee40c700 10 mon.cephmon00@0(leader) e40
> ms_handle_reset 0x5555637cf800 10.128.36.18:0/3882984371
> 2018-12-26 19:51:46.745395 7ffff0c11700 10 mon.cephmon00@0(leader).log
> v79246823 encode_full log v 79246823
> 2018-12-26 19:51:46.745469 7ffff0c11700 10 mon.cephmon00@0(leader).log
> v79246823 encode_pending v79246824
> 2018-12-26 19:51:46.745763 7fffee40c700 10 mon.cephmon00@0(leader) e40
> reset/close on session client.36398604 10.128.36.18:0/3882984371
> 2018-12-26 19:51:46.745769 7fffee40c700 10 mon.cephmon00@0(leader) e40
> remove_session 0x5555722bb980 client.36398604 10.128.36.18:0/3882984371
> features 0x27018fb86aa42ada
>
> Any pointers to what to do here?
>
> Andras
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Kjetil Joergensen <[email protected]>
SRE, Medallia Inc
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to