Rob,
        What I see on the system is the following.
At about 07:58, one of the file servers reported NMI and said it may be "dazed and confused" but that event did not cause panik on the machine. About 90 minutes later, the pvfs2 client said

Feb 14 09:32:03 bglfen kernel: system.posix_acl_access
Feb 14 09:32:03 bglfen kernel: klogd 1.4.1, ---------- state change ---------- Feb 14 09:34:19 bglfen kernel: Badness in __remove_from_page_cache at mm/filemap
.c:110
Feb 14 09:34:19 bglfen kernel: Call Trace:
Feb 14 09:34:19 bglfen kernel: [c000000119d93650] [c000000119d936e0] 0xc000000119d936e0 (unreliable) Feb 14 09:34:19 bglfen kernel: [c000000119d936f0] [c00000000009b9d4] .remove_from_page_cache+0x78/0xbc Feb 14 09:34:19 bglfen kernel: [c000000119d93780] [c0000000000a79fc] .truncate_complete_page+0x88/0x19c Feb 14 09:34:19 bglfen kernel: [c000000119d93800] [c0000000000a7c2c] .truncate_inode_pages+0x11c/0x4bc Feb 14 09:34:19 bglfen kernel: [c000000119d93950] [d000000000f9b148] .pvfs2_file_release+0x104/0x10c [pvfs2] Feb 14 09:34:19 bglfen kernel: [c000000119d939e0] [c0000000000cde28] .__fput+0x204/0x26c Feb 14 09:34:19 bglfen kernel: [c000000119d93a80] [c0000000000b429c] .exit_mmap+0x244/0x2c4 Feb 14 09:34:19 bglfen kernel: [c000000119d93b20] [c000000000060200] .mmput+0xb4/0x100 Feb 14 09:34:19 bglfen kernel: [c000000119d93bb0] [c000000000066868] .do_exit+0x378/0x1118 Feb 14 09:34:19 bglfen kernel: [c000000119d93c80] [c000000000067654] .do_group_exit+0x4c/0xf0 Feb 14 09:34:19 bglfen kernel: [c000000119d93d10] [c000000000010464] .ret_from_syscall_1+0x0/0xa4
Feb 14 09:34:19 bglfen kernel: --- Exception: c00 at 0xfeb6718
Feb 14 09:34:19 bglfen kernel:     LR = 0xff5664c

There was quite a few of such messages with very similar call traces. After that the call traces start pointing to other places in the kernel, e.g.,

Feb 14 09:34:20 bglfen kernel: Badness in page_remove_rmap at mm/objrmap.c:383
Feb 14 09:34:20 bglfen kernel: Call Trace:
Feb 14 09:34:20 bglfen kernel: [c000000074b23950] [c0000000000b1a70] .do_exit+0x378/0x1118 Feb 14 09:34:20 bglfen kernel: [c000000119d93c80] [c000000000067654] .unmap_vmas+0x67c/0x914 (unreliable) Feb 14 09:34:20 bglfen kernel: [c000000074b23a80] [c0000000000b4114] .do_group_exit+0x4c/0xf0 Feb 14 09:34:20 bglfen kernel: [c000000119d93d10] [c000000000010464] .ret_from_syscall_1+0x0/0xa4
Feb 14 09:34:20 bglfen kernel: --- Exception: c00 at 0xfeb6718
Feb 14 09:34:20 bglfen kernel:     LR = 0xff5664c
Feb 14 09:34:20 bglfen kernel: Badness in __remove_from_page_cache at mm/filemap.c:110
Feb 14 09:34:20 bglfen kernel: Call Trace:
Feb 14 09:34:20 bglfen kernel: [c000000119d93650] [c00000000009b870] .exit_mmap+0xbc/0x2c4 Feb 14 09:34:20 bglfen kernel: [c000000074b23b20] [c000000000060200] .__remove_from_page_cache+0x74/0x160 (unreliable) Feb 14 09:34:20 bglfen kernel: [c000000119d936f0] [c00000000009b9d4] .mmput+0xb4/0x100 Feb 14 09:34:20 bglfen kernel: [c000000074b23bb0] [c000000000066868] .do_exit+0x378/0x1118 Feb 14 09:34:20 bglfen kernel: [c000000074b23c80] [c000000000067654] .remove_from_page_cache+0x78/0xbc Feb 14 09:34:20 bglfen kernel: [c000000119d93780] [c0000000000a79fc] .do_group_exit+0x4c/0xf0 Feb 14 09:34:20 bglfen kernel: [c000000074b23d10] [c000000000010464] .ret_from_syscall_1+0x0/0xa4
Feb 14 09:34:20 bglfen kernel: --- Exception: c00 at 0xfeb6718
Feb 14 09:34:20 bglfen kernel:     LR = 0xff5664c

Either before that, or after that, pvfs2-client.log contains these lines:

[E 23:50:20.804738] Object Type mismatch error: Bad file descriptor
[E 23:50:20.842452] getattr_object_getattr_failure : Bad file descriptor
[E 23:50:20.842488] pvfs2-client-core: caught signal 11
[E 23:51:20.800850] Object Type mismatch error: Bad file descriptor
[E 23:51:20.801025] getattr_object_getattr_failure : Bad file descriptor
[E 23:51:20.801054] pvfs2-client-core: caught signal 11
[E 23:52:20.800206] Object Type mismatch error: Bad file descriptor
[E 23:52:20.800354] getattr_object_getattr_failure : Bad file descriptor
[E 23:52:20.800382] pvfs2-client-core: caught signal 11
[E 23:53:20.800813] Object Type mismatch error: Bad file descriptor
[E 23:53:20.800966] getattr_object_getattr_failure : Bad file descriptor
[E 23:53:20.800992] pvfs2-client-core: caught signal 11
[E 23:54:20.800199] Object Type mismatch error: Bad file descriptor
[E 23:54:20.800353] getattr_object_getattr_failure : Bad file descriptor
[E 23:54:20.800379] pvfs2-client-core: caught signal 11

--andrew

On Feb 14, 2006, at 5:09 PM, Robert Latham wrote:

On Tue, Feb 14, 2006 at 04:36:00PM -0500, Andrew Pochinsky wrote:
Hi,
        I see some strange behavior on a SLES9 client v 1.3.2 running on
power5. After some time the system log (dmesg) is full of messages like
the following:

Badness in page_remove_rmap at mm/objrmap.c:383
Call Trace:
[c000000072227950] [c0000000000b1a70] .unmap_vmas+0x67c/0x914
(unreliable)
[c000000072227a80] [c0000000000b4114] .exit_mmap+0xbc/0x2c4
[c000000072227b20] [c000000000060200] .mmput+0xb4/0x100
[c000000072227bb0] [c000000000066868] .do_exit+0x378/0x1118
[c000000072227c80] [c000000000067654] .do_group_exit+0x4c/0xf0
[c000000072227d10] [c000000000010464] .ret_from_syscall_1+0x0/0xa4
--- Exception: c00 at 0xfeb6718
    LR = 0xff5664c

The system show very poor PVFS2 performance and pvfs2-client-core takes a lot of cycles. The file system is under moderate load at this time. I
don't know if the dmesg is a cause or the result of pvfs2 degradation.
Has anybody seen such a behavior?

Hi Andrew

kernel comments say 'do_group_exit' is called by fatal signals or the
exit_group system call.   Can you check pvfs2-client.log?  If
something is causing pvfs2-client-core to die and get restarted often,
maybe it will show up in the log file?

==rob

--
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Labs, IL USA                B29D F333 664A 4280 315B

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to