TL;DR: my cluster is working now. Details and further problems below:

Jan Kasprzak wrote:
: I did
: 
: ceph tell mon.* config set mon_sync_max_payload_size 4096
: ceph config set mon mon_sync_max_payload_size 4096
: 
: and added "mon_sync_max_payload_size = 4096" into the [global] section
: of the mon host to be (re-)added, and ran
: 
: systemctl restart ceph-mon@mon1.service
: 
: on that host. But it did not help - mon1 did not join the cluster.

        I let it settle over the night, and apparently about four hours
after I did the above and let ceph-mon running without touching anything
further, the newly configured mon has successfully joined the cluster.

        So I upgraded mon1 to Quincy went on and tried to upgrade
another mon, named mon3. I also had to remove it from cluster and reinitialize
its data directory (apparently Quincy mon cannot handle my leveldb
and just crashed, so I mkfs'd a new rocksdb data directory).
Mon3 got registered to the cluster successfully about 10-20 seconds after
the start, but it could not join the quorum. There were the following
lines in the log file, repeated every second or so:

2022-09-09T09:01:52.611+0200 7f8ecc968700  0 log_channel(cluster) log [INF] : mo
n.mon3 calling monitor election
2022-09-09T09:01:52.611+0200 7f8ecc968700  1 paxos.2).electionLogic(62993) 
init, last seen epoch 62993, mid-election, bumping
2022-09-09T09:01:52.661+0200 7f8ecc968700  1 mon.mon3@2(electing) e33 
collect_metadata md127:  no unique device id for md127: fallback method has no 
model nor serial'

(/dev/md127 is my root filesystem, RAID-1)

When I stopped the ceph-mon@mon3.service, ceph -s reported many slow ops
on other two mons, mon1 and mon2, for a while. After another mkfs
and several restarts, it entered the quorum successfully. I don't know
what made the difference.

So I upgraded also mon2 with similar problems, but it also eventually
managed to join the cluster and the quorum.

-Yenya

-- 
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5 |
    We all agree on the necessity of compromise. We just can't agree on
    when it's necessary to compromise.                     --Larry Wall
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to