Greetings all, I'm wondering if anyone can shed some light here.
Some days ago an user reported problems dealing with a specific directory. After further investigation, I'm now suspecting that there was data corruption between a specific time period. Addressing the initial issue, I've checked /var/log/messages just in case and it had lots of messages like this: Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_check_dir_entry:111 ERROR: bad entry in directory #8075816: directory entry across blocks - offset=0, inode=1164370764863544510, rec_len=57912, name_len=167 Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_prepare_dir_for_insert:1734 ERROR: status = -2 Jan 18 14:24:40 fsnode01 kernel: (4626,1):ocfs2_mknod:240 ERROR: status = -2 Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_check_dir_entry:111 ERROR: bad entry in directory #8075816: directory entry across blocks - offset=0, inode=1164370764863544510, rec_len=57912, name_len=167 Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_prepare_dir_for_insert:1734 ERROR: status = -2 Jan 18 14:24:59 fsnode01 kernel: (4264,0):ocfs2_mknod:240 ERROR: status = -2 [...] Indeed, that inode was bound to the problematic directory: [r...@fsnode01 ~]# debugfs.ocfs2 -R "findpath <8075816>" /dev/sdc1 8075816 /storage/problematic/directory/ So I brought the cluster down and requested a filesystem check which dumped a lot of messages like this: Cluster 1135086 is claimed by the following inodes: /storage/unrelated/file1 /storage/unrelated/file2 [DUP_CLUSTERS_CLONE] Inode "/storage/unrelated/file1" may be cloned or deleted to break the claim it has on its clusters. Clone inode "/storage/unrelated/file1" to break claims on clusters it shares with other inodes? y pass1d: Invalid argument passed to OCFS2 library while reading inode to clone Just check that pass1d (last) message. I've checked my tools, and although they mismatch, they are the latest versions available: [r...@fsnode01 ~]# rpm -qa | grep ocfs ocfs2-tools-1.4.3-1.el5 ocfs2-2.6.18-164.el5-1.4.4-1.el5 ocfs2console-1.4.3-1.el5 Notice kernel modules are 1.4.4 and tools are 1.4.3. Could this version mismatch cause the pass1d error? Does it have any consequence? I've checked again, they were the only ones available... I must say /storage/unrelated/* are all PDF files. However, there are some damaged ones, and I've tracked some down using 'file -bi' to an interval of time between 'Jan 18 09:47' and 'Jan 18 12:24'. I could only track these files because 'file' reported a damaged PDF header, but I can't be sure the other ones are all OK, I can just say their header is OK. Also worth mentioning is that there are other files between that time interval that seem to be OK (again, I can't be sure). I can't be certain when this mess was started and when did the cluster recovered from this mess. I'm almost sure the files were OK when they were "about to be" stored on /storage. This investigation suggested they were damaged *during* their existence on /storage. I've now taken appropriate measures to prove this in the future. What is puzzling me is: * I now have knowledge of corrupted files, and I don't even know how many there is. I just know they are as much or more than those 'file' detected. Some of the files whose inodes fsck.ocfs2 tried to clone belong to the supra time period, and this suggests there were some kind of mess going on that the cluster wrote different files parts on the same blocks.. what could have caused this, and how do I avoid happening again? * Show I turn on tracing for a particular bit? Which one? * How can I monitor OCFS2 health on a running cluster? Regards, -- Nuno Tavares DRI, Consultoria Informática Telef: +351 936 184 086
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users