Bob Peterson <rpete...@redhat.com> writes:

> Hi Daniel,

Hello,


> I took a look at that metadata you sent me, but I didn't find any evidence
> relating to the problem you posted. Either the corruption happened a long
> time prior to your saving of the metadata, or else the metadata was saved
> after an fsck.gfs2 fixed (or attempted to fix) the problem?

- when I first encountered the problem, I did an fsck on the filesystem
  with version 3.1.6 from Ubuntu.

- several days after, the same messages “dirty_inode: glock -5”
  start showing on the same node as the first time.

- I did an fsck with 3.1.8 build from git

- few days after, the same node had the “dirty_inode” messages, I
  shutdown that node and then run the “gfs2_edit savemeta”.

All nodes are same hardware and OS/kernel/pacemaker version.

> One thing's for sure: I don't see any evidence of wild file system corruption;
> certainly nothing that can account for those errors.
>
> You said the problem seemed to revolve around a gfs2_grow operation,
> right?

Not exactly, I live grow the fs 6 months ago and encounter some
troubles, I did an fsck by that time and the fs run fine for months.

Then we had the “dirty_inode” troubles starting on Feb 9.

> Can you make sure the lvm2 volume group has the clustered bit set?
> Please do the "vgs" command and see if that volume has "c" listed in its
> flags. If not, it could have caused problems for the gfs2_grow.

Yes it has the cluster flag.

> I've seen problems like this very rarely. Once was a legitimate bug in
> GFS2 that we fixed in RHEL5, but I assume your kernel is newer than
> that.

We have 3.13.0-78-generic from Ubuntu.


[...]

> My only working theory is this:
>
> This might be related to the transition between "unlinked" dinodes and
> "free". After a file is deleted, it goes to "unlinked" and has to be
> transitioned to "free". This sometimes goes wrong because of the way
> it needs to check what other nodes in the cluster are doing.
>
> Maybe: If you have three nodes, and a file was unlinked on node 1, then
> maybe the internode communication got confused and nodes 2 and 3 both
> tried to transition it from Unlinked to Free. That is only a theory, and
> there is absolutely no proof. However, I have a set of patches that are
> experimental, and not even in the upstream kernel yet (hopefully soon!)
> that try to tighten up and fix problems like this. It's much more common
> for multiple nodes to try to transition from Unlinked to Free, and they
> all fail, leaving the file in an "Unlinked" state.

Thanks for the explanations, so I try to re-add the down node to the
cluster and see what happen.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

Attachment: signature.asc
Description: PGP signature

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to