Hi Behnam, I would firstly recommend running a filesystem check on the monitor disk first to see if there are any inconsistencies.
Is the disk where the monitor is running on a spinning disk or SSD? If SSD you should check the Wear level stats through smartctl. Maybe trim (discard) enabled on the filesystem mount? (discard could cause problems/corruption in combination with certain SSD firmwares) Caspar 2018-02-16 23:03 GMT+01:00 Behnam Loghmani <[email protected]>: > I checked the disk that monitor is on it with smartctl and it didn't > return any error and it doesn't have any Current_Pending_Sector. > Do you recommend any disk checks to make sure that this disk has problem > and then I can send the report to the provider for replacing the disk > > On Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum <[email protected]> > wrote: > >> The disk that the monitor is on...there isn't anything for you to >> configure about a monitor WAL though so I'm not sure how that enters into >> it? >> >> On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani < >> [email protected]> wrote: >> >>> Thanks for your reply >>> >>> Do you mean, that's the problem with the disk I use for WAL and DB? >>> >>> On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum <[email protected]> >>> wrote: >>> >>>> >>>> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani < >>>> [email protected]> wrote: >>>> >>>>> Hi there, >>>>> >>>>> I have a Ceph cluster version 12.2.2 on CentOS 7. >>>>> >>>>> It is a testing cluster and I have set it up 2 weeks ago. >>>>> after some days, I see that one of the three mons has stopped(out of >>>>> quorum) and I can't start it anymore. >>>>> I checked the mon service log and the output shows this error: >>>>> >>>>> """ >>>>> mon.XXXXXX@-1(probing) e4 preinit clean up potentially inconsistent >>>>> store state >>>>> rocksdb: submit_transaction_sync error: Corruption: block checksum >>>>> mismatch >>>>> >>>> >>>> This bit is the important one. Your disk is bad and it’s feeding back >>>> corrupted data. >>>> >>>> >>>> >>>> >>>>> code = 2 Rocksdb transaction: >>>>> 0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1 >>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/ >>>>> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/ >>>>> MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI >>>>> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void >>>>> MonitorDBStore::clear(std::set<std::basic_string<char> >&)' thread >>>>> 7f45a1e52e40 time 2018-02-16 17:37:07.040846 >>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/ >>>>> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/ >>>>> MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h: >>>>> 581: FAILE >>>>> D assert(r >= 0) >>>>> """ >>>>> >>>>> the only solution I found is to remove this mon from quorum and remove >>>>> all mon data and re-add this mon to quorum again. >>>>> and ceph goes to the healthy status again. >>>>> >>>>> but now after some days this mon has stopped and I face the same >>>>> problem again. >>>>> >>>>> My cluster setup is: >>>>> 4 osd hosts >>>>> total 8 osds >>>>> 3 mons >>>>> 1 rgw >>>>> >>>>> this cluster has setup with ceph-volume lvm and wal/db separation on >>>>> logical volumes. >>>>> >>>>> Best regards, >>>>> Behnam Loghmani >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> [email protected] >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>> > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
