Dear All,

We are experiencing frequent OSS crashes.
We have 4 OSS's and each OSS serves 6 OST's to 600 clients. We observe random OSS crashes every 1-2 days. See below console output captured during crash. Does is looks for some of you familiar? We have seen the same crashes with lustre 1.6.2



Nov 18 15:17:21 storage08 heartbeat: [25566]: info: Checking status of STONITH Nov 18 15:17:21 storage08 heartbeat: [24250]: info: Exiting STONITH- stat process

Kernel BUG at mballoc:3352

invalid operand: 0000 [1] SMP

CPU 0

Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) mdc(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) sg(U) ksocklnd(U) lnet(U) libcfs(U) cxgb3(U) ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) md5(U) ipv6(U) autofs4(U) i2c_nforce2(U) i2c_amd756(U) i2c_isa(U) i2c_amd8111(U) i2c_i801(U) i2c_core(U) mptctl(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) dm_mod(U) sr_mod(U) usb_storage(U) joydev(U) button(U) battery(U) ac (U) uhci_hcd(U) ehci_hcd(U) hw_random(U) qla2400(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) ext3(U) jbd(U) xfs(U) tg3(U) s2io(U) nfs(U) nfs_acl(U) lockd(U) sunrpc(U) mptsas(U) mptscsi(U) mptbase(U) megaraid_sas(U) e1000(U) bnx2(U) sd_mod(U)

Pid: 9070, comm: ll_ost_io_151 Tainted: GF 2.6.9-55.0.9.EL_lustre. 1.6.3smp

RIP: 0010:[<ffffffffa05e2923>] <ffffffffa05e2923> {:ldiskfs:ldiskfs_mb_generate_from_pa+179}

RSP: 0018:00000100c9721268  EFLAGS: 00010297

RAX: 0000000000002177 RBX: 0000000000000000 RCX: 00000100c9721288

RDX: 0000000000000000 RSI: 0000000000002178 RDI: 0000010077ce42b0

RBP: 0000010077ce4290 R08: 00000100c9721280 R09: 01ff80000007c008

R10: 0000080000000000 R11: ffffffffffffffff R12: 0000010077ce42b0

R13: 000001007fb09000 R14: 0000000000000000 R15: 00000100ad763c28

FS: 0000002a95565b00(0000) GS:ffffffff804a6700(0000) knlGS: 0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000002a984c80e8 CR3: 0000000000101000 CR4: 00000000000006e0

Process ll_ost_io_151 (pid: 9070, threadinfo 00000100c9720000, task 00000100c96f1800)

Stack: 0000000000001000 0000000000002177 00000100b5196400 0000000000002178

0000000000000000 0000000000002177 00000100a56d6ee0 0000000000000000

       0000000000002177 0000000000000000

Call Trace:<ffffffffa05e310a>{:ldiskfs:ldiskfs_mb_init_cache+1898}

       <ffffffffa05e3340>{:ldiskfs:ldiskfs_mb_load_buddy+304}

       <ffffffffa05e96e2>{:ldiskfs:ldiskfs_mb_free_blocks+626}

       <ffffffffa0180920>{:jbd:journal_get_write_access+48}

<ffffffff801589d9>{find_get_page+65} <ffffffff801798e7> {__find_get_block_slow+62}

<ffffffff8017a097>{__find_get_block+162} <ffffffffa0180920> {:jbd:journal_get_write_access+48}

       <ffffffffa05c9933>{:ldiskfs:ldiskfs_free_blocks+163}

       <ffffffffa05e165a>{:ldiskfs:ldiskfs_remove_blocks+282}

       <ffffffffa05e0ff4>{:ldiskfs:ldiskfs_ext_remove_space+1508}

       <ffffffffa05ce27c>{:ldiskfs:ldiskfs_mark_inode_dirty+76}

       <ffffffffa05e1f80>{:ldiskfs:ldiskfs_ext_truncate+368}

<ffffffffa05cfcb5>{:ldiskfs:ldiskfs_truncate+309} <ffffffff80167df9>{unmap_mapping_range+339}

       <ffffffffa05ce11a>{:ldiskfs:ldiskfs_mark_iloc_dirty+1034}

<ffffffff80167ea4>{vmtruncate+162} <ffffffff80191c88> {inode_setattr+41}

<ffffffffa05cf5bc>{:ldiskfs:ldiskfs_setattr+444} <ffffffffa062ae72>{:fsfilt_ldiskfs:fsfilt_ldiskfs_setattr+386}

       <ffffffffa064af7b>{:obdfilter:filter_destroy+3131}

<ffffffffa0456da0>{:ptlrpc:ldlm_completion_ast+0} <ffffffff802f069d>{tcp_rcv_established+2099}

       <ffffffffa047bd83>{:ptlrpc:lustre_msg_add_version+83}

       <ffffffffa047d205>{:ptlrpc:lustre_msg_check_version+69}

<ffffffffa061a25d>{:ost:ost_handle+6397} <ffffffff802dfc76> {ip_rcv+1046}

<ffffffff802c6861>{netif_receive_skb+791} <ffffffffa031a9ba> {:cxgb3:lro_flush_session+154}

       <ffffffffa035fb58>{:lnet:lnet_match_blocked_msg+920}

       <ffffffffa0485b4c>{:ptlrpc:ptlrpc_server_handle_request+3036}

<ffffffffa033cbae>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293}

<ffffffffa04881d8>{:ptlrpc:ptlrpc_main+2504} <ffffffff80133566>{default_wake_function+0}

<ffffffffa0486860>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0486860>{:ptlrpc:ptlrpc_retry_rqbds+0}

<ffffffff80110de3>{child_rip+8} <ffffffffa0487810> {:ptlrpc:ptlrpc_main+0}

       <ffffffff80110ddb>{child_rip+0}


Code: 0f 0b d2 bb 5e a0 ff ff ff ff 18 0d 90 8b 4c 24 20 8d 34 0b

RIP <ffffffffa05e2923>{:ldiskfs:ldiskfs_mb_generate_from_pa+179} RSP <00000100c9721268>

 <0>Kernel panic - not syncing: Oops


Best regards

Wojciech Turek

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: [EMAIL PROTECTED]
tel. +441223763517



_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to