Rob,
What I see on the system is the following.
At about 07:58, one of the file servers reported NMI and said it may be
"dazed and confused" but that event did not cause panik on the
machine. About 90 minutes later, the pvfs2 client said
Feb 14 09:32:03 bglfen kernel: system.posix_acl_access
Feb 14 09:32:03 bglfen kernel: klogd 1.4.1, ---------- state change
----------
Feb 14 09:34:19 bglfen kernel: Badness in __remove_from_page_cache at
mm/filemap
.c:110
Feb 14 09:34:19 bglfen kernel: Call Trace:
Feb 14 09:34:19 bglfen kernel: [c000000119d93650] [c000000119d936e0]
0xc000000119d936e0 (unreliable)
Feb 14 09:34:19 bglfen kernel: [c000000119d936f0] [c00000000009b9d4]
.remove_from_page_cache+0x78/0xbc
Feb 14 09:34:19 bglfen kernel: [c000000119d93780] [c0000000000a79fc]
.truncate_complete_page+0x88/0x19c
Feb 14 09:34:19 bglfen kernel: [c000000119d93800] [c0000000000a7c2c]
.truncate_inode_pages+0x11c/0x4bc
Feb 14 09:34:19 bglfen kernel: [c000000119d93950] [d000000000f9b148]
.pvfs2_file_release+0x104/0x10c [pvfs2]
Feb 14 09:34:19 bglfen kernel: [c000000119d939e0] [c0000000000cde28]
.__fput+0x204/0x26c
Feb 14 09:34:19 bglfen kernel: [c000000119d93a80] [c0000000000b429c]
.exit_mmap+0x244/0x2c4
Feb 14 09:34:19 bglfen kernel: [c000000119d93b20] [c000000000060200]
.mmput+0xb4/0x100
Feb 14 09:34:19 bglfen kernel: [c000000119d93bb0] [c000000000066868]
.do_exit+0x378/0x1118
Feb 14 09:34:19 bglfen kernel: [c000000119d93c80] [c000000000067654]
.do_group_exit+0x4c/0xf0
Feb 14 09:34:19 bglfen kernel: [c000000119d93d10] [c000000000010464]
.ret_from_syscall_1+0x0/0xa4
Feb 14 09:34:19 bglfen kernel: --- Exception: c00 at 0xfeb6718
Feb 14 09:34:19 bglfen kernel: LR = 0xff5664c
There was quite a few of such messages with very similar call traces.
After that the call traces start pointing to other places in the
kernel, e.g.,
Feb 14 09:34:20 bglfen kernel: Badness in page_remove_rmap at
mm/objrmap.c:383
Feb 14 09:34:20 bglfen kernel: Call Trace:
Feb 14 09:34:20 bglfen kernel: [c000000074b23950] [c0000000000b1a70]
.do_exit+0x378/0x1118
Feb 14 09:34:20 bglfen kernel: [c000000119d93c80] [c000000000067654]
.unmap_vmas+0x67c/0x914 (unreliable)
Feb 14 09:34:20 bglfen kernel: [c000000074b23a80] [c0000000000b4114]
.do_group_exit+0x4c/0xf0
Feb 14 09:34:20 bglfen kernel: [c000000119d93d10] [c000000000010464]
.ret_from_syscall_1+0x0/0xa4
Feb 14 09:34:20 bglfen kernel: --- Exception: c00 at 0xfeb6718
Feb 14 09:34:20 bglfen kernel: LR = 0xff5664c
Feb 14 09:34:20 bglfen kernel: Badness in __remove_from_page_cache at
mm/filemap.c:110
Feb 14 09:34:20 bglfen kernel: Call Trace:
Feb 14 09:34:20 bglfen kernel: [c000000119d93650] [c00000000009b870]
.exit_mmap+0xbc/0x2c4
Feb 14 09:34:20 bglfen kernel: [c000000074b23b20] [c000000000060200]
.__remove_from_page_cache+0x74/0x160 (unreliable)
Feb 14 09:34:20 bglfen kernel: [c000000119d936f0] [c00000000009b9d4]
.mmput+0xb4/0x100
Feb 14 09:34:20 bglfen kernel: [c000000074b23bb0] [c000000000066868]
.do_exit+0x378/0x1118
Feb 14 09:34:20 bglfen kernel: [c000000074b23c80] [c000000000067654]
.remove_from_page_cache+0x78/0xbc
Feb 14 09:34:20 bglfen kernel: [c000000119d93780] [c0000000000a79fc]
.do_group_exit+0x4c/0xf0
Feb 14 09:34:20 bglfen kernel: [c000000074b23d10] [c000000000010464]
.ret_from_syscall_1+0x0/0xa4
Feb 14 09:34:20 bglfen kernel: --- Exception: c00 at 0xfeb6718
Feb 14 09:34:20 bglfen kernel: LR = 0xff5664c
Either before that, or after that, pvfs2-client.log contains these
lines:
[E 23:50:20.804738] Object Type mismatch error: Bad file descriptor
[E 23:50:20.842452] getattr_object_getattr_failure : Bad file descriptor
[E 23:50:20.842488] pvfs2-client-core: caught signal 11
[E 23:51:20.800850] Object Type mismatch error: Bad file descriptor
[E 23:51:20.801025] getattr_object_getattr_failure : Bad file descriptor
[E 23:51:20.801054] pvfs2-client-core: caught signal 11
[E 23:52:20.800206] Object Type mismatch error: Bad file descriptor
[E 23:52:20.800354] getattr_object_getattr_failure : Bad file descriptor
[E 23:52:20.800382] pvfs2-client-core: caught signal 11
[E 23:53:20.800813] Object Type mismatch error: Bad file descriptor
[E 23:53:20.800966] getattr_object_getattr_failure : Bad file descriptor
[E 23:53:20.800992] pvfs2-client-core: caught signal 11
[E 23:54:20.800199] Object Type mismatch error: Bad file descriptor
[E 23:54:20.800353] getattr_object_getattr_failure : Bad file descriptor
[E 23:54:20.800379] pvfs2-client-core: caught signal 11
--andrew
On Feb 14, 2006, at 5:09 PM, Robert Latham wrote:
On Tue, Feb 14, 2006 at 04:36:00PM -0500, Andrew Pochinsky wrote:
Hi,
I see some strange behavior on a SLES9 client v 1.3.2 running on
power5. After some time the system log (dmesg) is full of messages
like
the following:
Badness in page_remove_rmap at mm/objrmap.c:383
Call Trace:
[c000000072227950] [c0000000000b1a70] .unmap_vmas+0x67c/0x914
(unreliable)
[c000000072227a80] [c0000000000b4114] .exit_mmap+0xbc/0x2c4
[c000000072227b20] [c000000000060200] .mmput+0xb4/0x100
[c000000072227bb0] [c000000000066868] .do_exit+0x378/0x1118
[c000000072227c80] [c000000000067654] .do_group_exit+0x4c/0xf0
[c000000072227d10] [c000000000010464] .ret_from_syscall_1+0x0/0xa4
--- Exception: c00 at 0xfeb6718
LR = 0xff5664c
The system show very poor PVFS2 performance and pvfs2-client-core
takes
a lot of cycles. The file system is under moderate load at this time.
I
don't know if the dmesg is a cause or the result of pvfs2 degradation.
Has anybody seen such a behavior?
Hi Andrew
kernel comments say 'do_group_exit' is called by fatal signals or the
exit_group system call. Can you check pvfs2-client.log? If
something is causing pvfs2-client-core to die and get restarted often,
maybe it will show up in the log file?
==rob
--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Labs, IL USA B29D F333 664A 4280 315B
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users