Hi Behnam,

I would firstly recommend running a filesystem check on the monitor disk
first to see if there are any inconsistencies.

Is the disk where the monitor is running on a spinning disk or SSD?

If SSD you should check the Wear level stats through smartctl.
Maybe trim (discard) enabled on the filesystem mount? (discard could cause
problems/corruption in combination with certain SSD firmwares)

Caspar

2018-02-16 23:03 GMT+01:00 Behnam Loghmani <[email protected]>:

> I checked the disk that monitor is on it with smartctl and it didn't
> return any error and it doesn't have any Current_Pending_Sector.
> Do you recommend any disk checks to make sure that this disk has problem
> and then I can send the report to the provider for replacing the disk
>
> On Sat, Feb 17, 2018 at 1:09 AM, Gregory Farnum <[email protected]>
> wrote:
>
>> The disk that the monitor is on...there isn't anything for you to
>> configure about a monitor WAL though so I'm not sure how that enters into
>> it?
>>
>> On Fri, Feb 16, 2018 at 12:46 PM Behnam Loghmani <
>> [email protected]> wrote:
>>
>>> Thanks for your reply
>>>
>>> Do you mean, that's the problem with the disk I use for WAL and DB?
>>>
>>> On Fri, Feb 16, 2018 at 11:33 PM, Gregory Farnum <[email protected]>
>>> wrote:
>>>
>>>>
>>>> On Fri, Feb 16, 2018 at 7:37 AM Behnam Loghmani <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I have a Ceph cluster version 12.2.2 on CentOS 7.
>>>>>
>>>>> It is a testing cluster and I have set it up 2 weeks ago.
>>>>> after some days, I see that one of the three mons has stopped(out of
>>>>> quorum) and I can't start it anymore.
>>>>> I checked the mon service log and the output shows this error:
>>>>>
>>>>> """
>>>>> mon.XXXXXX@-1(probing) e4 preinit clean up potentially inconsistent
>>>>> store state
>>>>> rocksdb: submit_transaction_sync error: Corruption: block checksum
>>>>> mismatch
>>>>>
>>>>
>>>> This bit is the important one. Your disk is bad and it’s feeding back
>>>> corrupted data.
>>>>
>>>>
>>>>
>>>>
>>>>> code = 2 Rocksdb transaction:
>>>>>      0> 2018-02-16 17:37:07.041812 7f45a1e52e40 -1
>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
>>>>> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
>>>>> MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUI
>>>>> LD/ceph-12.2.2/src/mon/MonitorDBStore.h: In function 'void
>>>>> MonitorDBStore::clear(std::set<std::basic_string<char> >&)' thread
>>>>> 7f45a1e52e40 time 2018-02-16 17:37:07.040846
>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/
>>>>> AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/
>>>>> MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/mon/MonitorDBStore.h:
>>>>> 581: FAILE
>>>>> D assert(r >= 0)
>>>>> """
>>>>>
>>>>> the only solution I found is to remove this mon from quorum and remove
>>>>> all mon data and re-add this mon to quorum again.
>>>>> and ceph goes to the healthy status again.
>>>>>
>>>>> but now after some days this mon has stopped and I face the same
>>>>> problem again.
>>>>>
>>>>> My cluster setup is:
>>>>> 4 osd hosts
>>>>> total 8 osds
>>>>> 3 mons
>>>>> 1 rgw
>>>>>
>>>>> this cluster has setup with ceph-volume lvm and wal/db separation on
>>>>> logical volumes.
>>>>>
>>>>> Best regards,
>>>>> Behnam Loghmani
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> [email protected]
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to