Re: [ceph-users] Surviving a ceph cluster outage: the hard way

Kostis Fardelas Wed, 26 Oct 2016 22:27:28 -0700

It is not more than a three line script. You will also need leveldb's
code in your working directory:


```
#!/usr/bin/python2

import leveldb
leveldb.RepairDB('./omap')
```

I totally agree that we need more repair tools to be officially
available and also tools that provide better insight to components
that are a "black box" for the operator right now ie the journal

On 24 October 2016 at 19:36, Dan Jakubiec <[email protected]> wrote:
> Thanks Kostis, great read.
>
> We also had a Ceph disaster back in August and a lot of this experience 
> looked familiar.  Sadly, in the end we were not able to recover our cluster 
> but glad to hear that you were successful.
>
> LevelDB corruptions were one of our big problems.  Your note below about 
> running RepairDB from Python is interesting.  At the time we were looking for 
> a Ceph tool to run LevelDB repairs in order to get our OSDs back up and 
> couldn't find one.  I felt like this is something that should be in the 
> standard toolkit.
>
> Would be great to see this added some day, but in the meantime I will 
> remember this option exists.  If you still have the Python script, perhaps 
> you could post it as an example?
>
> Thanks!
>
> -- Dan
>
>
>> On Oct 20, 2016, at 01:42, Kostis Fardelas <[email protected]> wrote:
>>
>> We pulled leveldb from upstream and fired leveldb.RepairDB against the
>> OSD omap directory using a simple python script. Ultimately, that
>> didn't make things forward. We resorted to check every object's
>> timestamp/md5sum/attributes on the crashed OSD against the replicas in
>> the cluster and at last took the way of discarding the journal, when
>> we concluded with as much confidence as possible that we would not
>> lose data.
>>
>> It would be really useful at that moment if we had a tool to inspect
>> the journal's contents of the crashed OSD and limit the scope of the
>> verification process.
>>
>> On 20 October 2016 at 08:15, Goncalo Borges
>> <[email protected]> wrote:
>>> Hi Kostis...
>>> That is a tale from the dark side. Glad you recover it and that you were 
>>> willing to doc it all up, and share it. Thank you for that,
>>> Can I also ask which tool did you use to recover the leveldb?
>>> Cheers
>>> Goncalo
>>> ________________________________________
>>> From: ceph-users [[email protected]] on behalf of Kostis 
>>> Fardelas [[email protected]]
>>> Sent: 20 October 2016 09:09
>>> To: ceph-users
>>> Subject: [ceph-users] Surviving a ceph cluster outage: the hard way
>>>
>>> Hello cephers,
>>> this is the blog post on our Ceph cluster's outage we experienced some
>>> weeks ago and about how we managed to revive the cluster and our
>>> clients's data.
>>>
>>> I hope it will prove useful for anyone who will find himself/herself
>>> in a similar position. Thanks for everyone on the ceph-users and
>>> ceph-devel lists who contributed to our inquiries during
>>> troubleshooting.
>>>
>>> https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/
>>>
>>> Regards,
>>> Kostis
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Surviving a ceph cluster outage: the hard way

Reply via email to