On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
> > On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <yann.dup...@univ-nantes.fr 
> > (mailto:yann.dup...@univ-nantes.fr)> wrote:
> > > In the case I could repair, do you think a crashed FS as it is right now 
> > > is
> > > valuable for you, for future reference , as I saw you can't reproduce the
> > > problem ? I can make an archive (or a btrfs dump ?), but it will be quite
> > > big.
> >  
> >  
> > At this point, it's more about the upstream developers (of btrfs etc)
> > than us; we're on good terms with them but not experts on the on-disk
> > format(s). You might want to send an email to the relevant mailing
> > lists before wiping the disks.
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org 
> > (mailto:majord...@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>  
>  
> Well, I probably wasn't clear enough. I talked about crashed FS, but i  
> was talking about ceph. The underlying FS (btrfs in that case) of 1 node  
> (and only one) has PROBABLY crashed in the past, causing corruption in  
> ceph data on this node, and then the subsequent crash of other nodes.
>  
> RIGHT now btrfs on this node is OK. I can access the filesystem without  
> errors.
>  
> For the moment, on 8 nodes, 4 refuse to restart .
> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem  
> with the underlying fs as far as I can tell.
>  
> So I think the scenario is :
>  
> One node had problem with btrfs, leading first to kernel problem ,  
> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a  
> kernel oops. Before that ultimate kernel oops, bad data has been  
> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses  
> nodes.

I don't think that's actually possible — the OSDs all do quite a lot of 
interpretation between what they get off the wire and what goes on disk. What 
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do 
that through the interfaces we have. :/
  
>  
> If you think this scenario is highly improbable in real life (that is,  
> btrfs will probably be fixed for good, and then, corruption can't  
> happen), it's ok.
>  
> But I wonder if this scenario can be triggered with other problem, and  
> bad data can be transmitted to other sane nodes (power outage, out of  
> memory condition, disk full... for example)
>  
> That's why I proposed you a crashed ceph volume image (I shouldn't have  
> talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk 
state managed by somebody else, not our logical state, which has broken. If we 
could figure out how that state got broken that'd be good, but a "ceph image" 
won't really help in doing so.

I wonder if maybe there's a confounding factor here — are all your nodes 
similar to each other, or are they running on different kinds of hardware? How 
did you do your Ceph upgrades? What's ceph -s display when the cluster is 
running as best it can?
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to