Hi, On Thu, Mar 13, 2008 at 03:29:29AM -0600, Andreas Dilger wrote: > On Mar 13, 2008 10:15 +0100, Frank Mietke wrote: > > we're using Lustre-1.6.4.2 and now one of our OSS (comprising two OSTs) > > shows > > the status "not healthy". > > > > dmesg tells the following: > > ... > > [3082673.456429] LustreError: > > 16561:0:(filter_io_26.c:705:filter_commitrw_write()) error starting > > transaction: > > rc = -30 > > > > I've found that it seems to be the error EROFS. The documentation states > > that I > > have to restart Lustre services. Is it enough to umount / mount both OSTs on > > this OSS or do I have to umount everything (MDS/OSS)? Anything else to care > > about? > > You should investigate in your /var/log/messages why this happened. It > is usually a sign of filesystem corruption or disk errors, so you would > likely also need to run e2fsck before remounting the filesystem. okay I've found the following in /var/log/messages before the bulk of above messages come. It seems that something with the RAID went wrong. Any hints?
Mar 13 05:50:37 chic2e24 kernel: [3067020.190468] LustreError: 4574:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 116733: rc -2 Mar 13 05:50:37 chic2e24 kernel: [3067020.190907] LustreError: 4574:0:(ldlm_resource.c:719:ldlm_resource_add()) Skipped 1 previous similar message Mar 13 05:50:57 chic2e24 kernel: [3067040.964208] LustreError: 4598:0:(ldlm_resource.c:719:ldlm_resource_add()) lvbo_init failed for resource 10518: rc -2 Mar 13 05:50:57 chic2e24 kernel: [3067040.964652] LustreError: 4598:0:(ldlm_resource.c:719:ldlm_resource_add()) Skipped 2 previous similar messages Mar 13 06:17:31 chic2e24 kernel: [3068633.701448] attempt to access beyond end of device Mar 13 06:17:31 chic2e24 kernel: [3068633.701454] sda: rw=1, want=11287722456, limit=7796867072 Mar 13 06:17:31 chic2e24 kernel: [3068633.701555] attempt to access beyond end of device Mar 13 06:17:31 chic2e24 kernel: [3068633.701558] sda: rw=1, want=25366292592, limit=7796867072 Mar 13 06:17:31 chic2e24 kernel: [3068633.701562] Buffer I/O error on device sda, logical block 3170786573 Mar 13 06:17:31 chic2e24 kernel: [3068633.701785] lost page write due to I/O error on sda Mar 13 06:17:31 chic2e24 kernel: [3068633.702004] Aborting journal on device sda. Mar 13 06:17:31 chic2e24 kernel: [3068633.702226] LustreError: 4493:0:(obd.h:1038:obd_transno_commit_cb()) chicfs-OST0010: transno 6510615555435490347 commit error: 2 Mar 13 06:17:31 chic2e24 kernel: [3068633.702933] LDISKFS-fs error (device sda) in ldiskfs_reserve_inode_write: Journal has aborted Mar 13 06:17:31 chic2e24 kernel: [3068633.703587] Remounting filesystem read-only Mar 13 06:17:31 chic2e24 kernel: [3068633.704001] journal commit I/O error Mar 13 06:17:31 chic2e24 kernel: [3068633.704981] LDISKFS-fs error (device sda) in ldiskfs_dirty_inode: Journal has aborted Mar 13 06:17:31 chic2e24 kernel: [3068633.705034] LustreError: 5887:0:(filter_io_26.c:767:filter_commitrw_write()) Failure to commit OST transaction (-5)? Mar 13 06:17:31 chic2e24 kernel: [3068633.706134] LustreError: 4662:0:(fsfilt-ldiskfs.c:1318:fsfilt_ldiskfs_write_record()) can't start transaction for 37 blocks (128 bytes) Mar 13 06:17:31 chic2e24 kernel: [3068633.706718] LustreError: 4662:0:(filter.c:139:filter_finish_transno()) wrote trans 6510615555435490348 for client 67e1aea3-f93a-affd-b39d-eefa306ae345 at #212: err = -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.707570] LustreError: 4662:0:(filter_io_26.c:566:filter_direct_io()) can't close transaction: -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.708153] LustreError: 4662:0:(fsfilt-ldiskfs.c:483:fsfilt_ldiskfs_commit_async()) error while stopping transaction: -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.708735] LustreError: 4662:0:(filter_io_26.c:767:filter_commitrw_write()) Failure to commit OST transaction (-5)? Mar 13 06:17:31 chic2e24 kernel: [3068633.708875] LustreError: 16324:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) can't get handle for 530 credits: rc = -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.708881] LustreError: 16324:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.708976] LustreError: 4776:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.709006] LustreError: 4742:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.711072] LustreError: 4493:0:(obd.h:1038:obd_transno_commit_cb()) chicfs-OST0010: transno 6510615555435490348 commit error: 2 Mar 13 06:17:31 chic2e24 kernel: [3068633.711100] LustreError: 16385:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) can't get handle for 530 credits: rc = -30 Mar 13 06:17:31 chic2e24 kernel: [3068633.711105] LustreError: 16385:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) Skipped 2 previous similar messages Mar 13 06:17:31 chic2e24 kernel: [3068633.711110] LustreError: 16385:0:(filter_io_26.c:705:filter_commitrw_write()) error starting transaction: rc = -30 Best Regards, Frank > > Doing the unmount/mount of just the OSTs should be enough > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > -- Dipl.-Inf. Frank Mietke | Fakultätsrechen- und Informationszentrum Tel.: 0371 - 531 - 35538 | Fak. für Informatik Fax: 0371 - 531 8 35538 | TU-Chemnitz Key-ID: 60F59599 | [EMAIL PROTECTED] _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
