Thanks for the reply, I have had some more time to mess around more with this 
now.

I understand that the best thing is to allow it to rebuild the entire OSD, but 
I am currently only using one replica and 2/3 machines had problems I ended up 
in a bad situation. With OSDs down on 2 machines and one replica I think I 
would lose data for certain if I rebuilt them from scratch. Luckily in my case 
there was no new data being written to the cluster at that time, I only use it 
as a NAS in my home-lab.

It did work out fine for me this time but I guess anyone reading this should 
know it is not a recommended way to do things. I got confused because I was 
reusing a logical volume as journal and I didn´t wipe it properly before I used 
"--mkjournal", after wiping it properly and then using "--mkjournal" seems to 
have solved the problem for me.

My only withstanding issue now is one pg that remains inconsistent even after 
trying to do a repair, besides that everything seems to be fine. I haven´t 
digged too much into that yet, with only one replica I guess it is ticky to 
guess which of the replicas that is the broken one.

I will add a note to that ticket, it happened when the power to the server was 
lost while replicating and I think that is what made two journals corrupt.
 
Cheers,
Claes



-----Original Message-----
From: Sage Weil [mailto:[email protected]] 
Sent: den 12 januari 2015 15:46
To: Sahlstrom, Claes
Cc: [email protected]
Subject: Re: [ceph-users] Replace corrupt journal

On Sun, 11 Jan 2015, Sahlstrom, Claes wrote:
> 
> Hi,
> 
>  
> 
> I have a problem starting a couple of OSDs because of the journal 
> being corrupt. Is there any way to replace the journal and keeping the 
> rest of the OSD intact.

It is risky at best... I would not recommend it!  The safe route is to wipe the 
OSD and let the cluster repair.

>     -1> 2015-01-11 16:02:54.475138 7fb32df86900 -1 journal Unable to 
> read past sequence 8188178 but header indicates the journal has 
> committed up through 8188206, journal is corrupt
> 
>      0> 2015-01-11 16:02:54.479296 7fb32df86900 -1 os/FileJournal.cc: 
> In function 'bool FileJournal::read_entry(ceph::bufferlist&, uint64_t&, 
> bool*)'
> thread 7fb32df86900 time 2015-01-11 16:02:54.475276
> 
> os/FileJournal.cc: 1693: FAILED assert(0)

Do you mind making a note that you saw this on this ticket:

        http://tracker.ceph.com/issues/6003

We see it periodically in QA but have never been able to track it down.  
It could also be caused by a hardware issue, so any information about whether 
the journal device appears damanged would be helpful.

Thanks!
sage
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to