On 06/02/2014 12:41 PM, Felix Lee wrote:
Hi, Craig,
Many thanks for your reply.
The disk was completely recovered, the filesystem error was caused by
fiber connection broke(cable issue), the disk/RAID itself is health, so,
there is no physical disk error but filesystem corruption in our case.
The file system itself was recovered by xfs_repair, couple lost files
were found and put into lost+found directory after xfs_repair, now the
filesystem should be working fine, but perhaps, there are still some
corrupted files which ain't be able to be detected by xfs_repair tool, I
can't tell...
On the other hand, I tried with 'ceph-osd --mkjournal' to make new
journal, but after the command was done successfully, it still complains
the same error.., so, I am out of options too.., I would completely
remove that osd and recreate new osd then. (thanks to great ceph, we
won't lose any data anyway. :) )
Why even try to recover the XFS filesystem? Hardware failure is the
rule, not the exception.
Let Ceph handle the failure. Re-format the XFS filesystem, bring the OSD
up and let the Ceph recovery do it's job.
Wido
For the "D" state process, although this still bothers me, but I
understand there is nothing we can do for it, the only way would be to
reboot machine if it can't be recovered by recovering the disk and
filesystem or by "kill -SIGHUP", that's how Linux kernel works.
In any case, thanks again for your reply.
Best regards,
Felix Lee ~
On 2014年05月30日 23:53, Craig Lewis wrote:
On 5/29/14 01:09 , Felix Lee wrote:
Dear experts,
Recently, a disk for one of our OSDs was failure and caused osd down,
after I recovered the disk and filesystem, I noticed two problems:
1. journal corruption, which causes osd failure from starting:
2. I guess I may use ceph-osd with "--mkjournal" option to fix journal
corruption issue, but there is another thing that bothers me, which
is, the previous osd daemon is staying in "D" state, so, it can't be
terminated, but usually, when filesystem recovered, process should be
able to leave D state, so, I am not sure what causes this and if I can
ignore that without causing any bad consequence.
In any case, it would be very grateful if you experts could shed me
some light.
Our current ceph version is ceph-0.72.2-0.el6.x86_64
And, the filesystem backend is xfs with fiber direct attached storages.
I can't speak to the specific errors you're seeing, but it looks like
you have a failing or corrupted disk.
Things I would investigate:
1. Is the disk itself failing? If this were a SATA disk, I'd check the
SMART stats on the disk. I haven't dealt with Fiber Channel disks
since before SMART was standardized, so I can't tell you do do that.
2. Get rid of the old ceph-osd process. Reboot the node if you have
to. If things come up cleanly, then you're done.
3. Fsck the filesystem. If the FS is clean, then you probably
corrupted the OSD journal.
4. How quickly do you need this fixed? At this point, I'm out of
suggestions, so I'd remove the osd, zap it, and add it back in. If
you can wait, somebody might have a better suggestion.
Fiber Channel hardware is much more complicated that SATA and SAS.
There are a lot more parts involved, which leaves more room for bugs.
If you see this problem come back on the same disk, I'd replace the
disk. If you see this happen again to other disks, I would get your
Fiber Channel vendor involved. It wouldn't hurt to make sure you have
the latest firmware on the disks, enclosure, and FC adapter.
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Wido den Hollander
42on B.V.
Ceph trainer and consultant
Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com