Hello again, I'm sorry to insist on this matter, but I'm wondering if this message went through unnoticed or if I said something absurd (?). I find strange nobody had nothing to say.
> * I now have knowledge of corrupted files, and I don't even know how > many there is. I just know they are as much or more than those 'file' > detected. Some of the files whose inodes fsck.ocfs2 tried to clone > belong to the supra time period, and this suggests there were some kind > of mess going on that the cluster wrote different files parts on the > same blocks.. what could have caused this, and how do I avoid happening > again? > > * Show I turn on tracing for a particular bit? Which one? > > * How can I monitor OCFS2 health on a running cluster? I also find strange what mounted.ocfs2 reports: [r...@server01 ~]# mounted.ocfs2 -f Device FS Nodes /dev/sdc1 ocfs2 Unknown, server01 The output is consistent to server02's. I'd really like to hear your thoughts on this. -- Nuno Tavares DRI, Consultoria Informática Telef: +351 936 184 086 Nuno Tavares escreveu: > Greetings all, > > I'm wondering if anyone can shed some light here. > > Some days ago an user reported problems dealing with a specific > directory. After further investigation, I'm now suspecting that there > was data corruption between a specific time period. > > Addressing the initial issue, I've checked /var/log/messages just in > case and it had lots of messages like this: > > Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_check_dir_entry:111 ERROR: > bad entry in directory #8075816: directory entry across blocks - > offset=0, inode=1164370764863544510, rec_len=57912, name_len=167 > Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_prepare_dir_for_insert:1734 > ERROR: status = -2 > Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_mknod:240 ERROR: status = -2 > Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_check_dir_entry:111 ERROR: > bad entry in directory #8075816: directory entry across blocks - > offset=0, inode=1164370764863544510, rec_len=57912, name_len=167 > Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_prepare_dir_for_insert:1734 > ERROR: status = -2 > Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_mknod:240 ERROR: status = -2 > [...] > > Indeed, that inode was bound to the problematic directory: > > [r...@fsnode01 ~]# debugfs.ocfs2 -R "findpath <8075816>" /dev/sdc1 > 8075816 /storage/problematic/directory/ > > So I brought the cluster down and requested a filesystem check which > dumped a lot of messages like this: > > Cluster 1135086 is claimed by the following inodes: > /storage/unrelated/file1 > /storage/unrelated/file2 > [DUP_CLUSTERS_CLONE] Inode "/storage/unrelated/file1" may be cloned or > deleted to break the claim it has on its clusters. Clone inode > "/storage/unrelated/file1" to break claims on clusters it shares with > other inodes? y > pass1d: Invalid argument passed to OCFS2 library while reading inode to > clone > > Just check that pass1d (last) message. I've checked my tools, and > although they mismatch, they are the latest versions available: > [r...@fsnode01 ~]# rpm -qa | grep ocfs > ocfs2-tools-1.4.3-1.el5 > ocfs2-2.6.18-164.el5-1.4.4-1.el5 > ocfs2console-1.4.3-1.el5 > > Notice kernel modules are 1.4.4 and tools are 1.4.3. Could this version > mismatch cause the pass1d error? Does it have any consequence? I've > checked again, they were the only ones available... > > I must say /storage/unrelated/* are all PDF files. However, there are > some damaged ones, and I've tracked some down using 'file -bi' to an > interval of time between 'Jan 18 09:47' and 'Jan 18 12:24'. I could only > track these files because 'file' reported a damaged PDF header, but I > can't be sure the other ones are all OK, I can just say their header is OK. > > Also worth mentioning is that there are other files between that time > interval that seem to be OK (again, I can't be sure). I can't be certain > when this mess was started and when did the cluster recovered from this > mess. > > I'm almost sure the files were OK when they were "about to be" stored on > /storage. This investigation suggested they were damaged *during* their > existence on /storage. I've now taken appropriate measures to prove this > in the future. > > What is puzzling me is: > * I now have knowledge of corrupted files, and I don't even know how > many there is. I just know they are as much or more than those 'file' > detected. Some of the files whose inodes fsck.ocfs2 tried to clone > belong to the supra time period, and this suggests there were some kind > of mess going on that the cluster wrote different files parts on the > same blocks.. what could have caused this, and how do I avoid happening > again? > > * Show I turn on tracing for a particular bit? Which one? > > * How can I monitor OCFS2 health on a running cluster? > > Regards, > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users