We managed to trigger another intriguing lustre failure by simply
rebooting our two OSS's (which has two OST's each), and then mounting
the OST's on both OSS's in parallell (ie mount ost1 ; mount ost2 on
both OSS's at the same time). The filesystem was mounted by 191
clients, of which twenty-ish hade been recently active. Interconnect
used is standard tcp over GigE.
Both OSS's were really locking up intermittently after this, it all
ended with having to do a hard reboot. Taking it slow and leisurely
mounting one OST, then wait, another one, then wait until we had all
four mounted resulted in a working lustre backend.
The OSS's are dual CPU amd64 boxes.
I tried finding a corresponding issue in bugzilla, but I failed
miserably as ususal. Thoughts and suggestions are welcome...
We got tons of soft lockups and call traces, here are the first one
from each OSS (with some added interesting log lines from the first OSS):
-------------------8<------------------------
[ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid
6802: it was inactive for 100s
[ 1385.082596] LustreError: 6877:0:(ldlm_lockd.c:214:waiting_locks_callback())
### lock callback timer expired: evicting client [EMAIL PROTECTED] nid [EMAIL
PROTECTED] ns: filter-hpfs-OST0000_UUID lock:
ffff810052c08780/0x3a2961c77b2b19c1 lrc: 2/0,0 mode: PW/PW res: 1416400/0 rrc:
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags:
10020 remote: 0x171c36570b6bb69d expref: 82 pid: 6401
[ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@
type 4, status -5 [EMAIL PROTECTED] x38/t0 o104->@NET_0x2000082ef4e80_UUID:15
lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22
[ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback()) Skipped
1158333 previous similar messages
[ 1394.804079] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request())
@@@ network error (sent at 1190712998, 0s ago) [EMAIL PROTECTED] x177/t0
o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22
[ 1394.825111] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request())
Skipped 1959868 previous similar messages
[ 1395.921898] BUG: soft lockup detected on CPU#0!
[ 1395.926481]
[ 1395.926482] Call Trace:
[ 1395.930474] <IRQ> [<ffffffff80263e7c>] softlockup_tick+0xfc/0x120
[ 1395.936758] [<ffffffff8023f1c7>] update_process_times+0x57/0x90
[ 1395.942820] [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50
[ 1395.949332] [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50
[ 1395.955751] [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
[ 1395.961811] <EOI> [<ffffffff803153b9>] number+0x109/0x270
[ 1395.967399] [<ffffffff80315ad1>] vsnprintf+0x5b1/0x630
[ 1395.972707] [<ffffffff883ad5d4>] :libcfs:libcfs_debug_vmsg2+0x494/0x8e0
[ 1395.979488] [<ffffffff8022b8e3>] __wake_up_common+0x43/0x80
[ 1395.985252] [<ffffffff884d5520>] :ptlrpc:request_out_callback+0x0/0x1f0
[ 1395.992056] [<ffffffff884d5520>] :ptlrpc:request_out_callback+0x0/0x1f0
[ 1395.998859] [<ffffffff884d1962>] :ptlrpc:_debug_req+0x3e2/0x420
[ 1396.004963] [<ffffffff8853b7a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100
[ 1396.011651] [<ffffffff8853ccbb>] :ksocklnd:ksocknal_send+0x2eb/0x330
[ 1396.018188] [<ffffffff884d5601>] :ptlrpc:request_out_callback+0xe1/0x1f0
[ 1396.025056] [<ffffffff883e0b5b>] :lnet:lnet_enq_event_locked+0x9b/0xf0
[ 1396.031747] [<ffffffff883e0f69>] :lnet:lnet_finalize+0x119/0x260
[ 1396.037914] [<ffffffff883e7ba4>] :lnet:lnet_send+0x9c4/0x9e0
[ 1396.043728] [<ffffffff883e3437>] :lnet:lnet_prep_send+0x67/0xb0
[ 1396.049807] [<ffffffff883e8963>] :lnet:LNetPut+0x6a3/0x740
[ 1396.055445] [<ffffffff883e2c69>] :lnet:LNetMDBind+0x2b9/0x3a0
[ 1396.061370] [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1396.067643] [<ffffffff884cc9ce>] :ptlrpc:ptl_send_rpc+0xc7e/0xf70
[ 1396.073914] [<ffffffff884c555a>] :ptlrpc:ptlrpc_check_set+0x66a/0xb90
[ 1396.080527] [<ffffffff884bf4a9>] :ptlrpc:ptlrpc_set_next_timeout+0x19/0x170
[ 1396.087674] [<ffffffff884c5c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510
[ 1396.094191] [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20
[ 1396.099560] [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1396.105652] [<ffffffff884980fb>]
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
[ 1396.113405] [<ffffffff88499133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
[ 1396.120369] [<ffffffff884ac5db>]
:ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720
[ 1396.127684] [<ffffffff884ad156>]
:ptlrpc:ldlm_process_extent_lock+0x536/0x670
[ 1396.134986] [<ffffffff80315c81>] sprintf+0x51/0x60
[ 1396.139923] [<ffffffff80287ebe>] transfer_objects+0x4e/0x80
[ 1396.145663] [<ffffffff8849c22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0
[ 1396.152382] [<ffffffff88499928>] :ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0
[ 1396.159615] [<ffffffff8849a366>]
:ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0
[ 1396.167454] [<ffffffff884b0d70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
[ 1396.173975] [<ffffffff884ad5d3>] :ptlrpc:ldlm_cli_enqueue_local+0x323/0x510
[ 1396.181110] [<ffffffff886ceeec>] :obdfilter:filter_destroy+0x5dc/0x1ed0
[ 1396.187913] [<ffffffff884af280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0
[ 1396.194624] [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1396.200888] [<ffffffff884d09b1>] :ptlrpc:lustre_msg_add_version+0x61/0x140
[ 1396.207939] [<ffffffff884d3e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60
[ 1396.214636] [<ffffffff8869d4a1>] :ost:ost_handle+0x1db1/0x579c
[ 1396.220607] [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90
[ 1396.225979] [<ffffffff8022b4f6>] __activate_task+0x36/0x50
[ 1396.231617] [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0
[ 1396.237559] [<ffffffff8843bd05>] :obdclass:class_handle2object+0xd5/0x160
[ 1396.244535] [<ffffffff884cd1f0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70
[ 1396.251506] [<ffffffff884d1a65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
[ 1396.257870] [<ffffffff884d7bcd>]
:ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0
[ 1396.265626] [<ffffffff884d951e>] :ptlrpc:ptlrpc_start_thread+0x95e/0xa00
[ 1396.272492] [<ffffffff80416ae0>] thread_return+0x0/0x100
[ 1396.277940] [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0
[ 1396.283579] [<ffffffff883adab6>] :libcfs:lcw_update_time+0x16/0x100
[ 1396.289997] [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60
[ 1396.295627] [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0
[ 1396.300925] [<ffffffff884d9e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0
[ 1396.307083] [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1396.313161] [<ffffffff8020ac4c>] child_rip+0xa/0x12
[ 1396.318197] [<ffffffff884d95c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0
[ 1396.324193] [<ffffffff8020ac42>] child_rip+0x0/0x12
-------------------8<------------------------
-------------------8<------------------------
[ 1362.254118] BUG: soft lockup detected on CPU#1!
[ 1362.258808]
[ 1362.258809] Call Trace:
[ 1362.262940] <IRQ> [<ffffffff80263e7c>] softlockup_tick+0xfc/0x120
[ 1362.269408] [<ffffffff8023f1c7>] update_process_times+0x57/0x90
[ 1362.275742] [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50
[ 1362.282462] [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50
[ 1362.289117] [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
[ 1362.295395] <EOI> [<ffffffff8031588b>] vsnprintf+0x36b/0x630
[ 1362.301519] [<ffffffff883ab5d4>] :libcfs:libcfs_debug_vmsg2+0x494/0x8e0
[ 1362.308574] [<ffffffff885397a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100
[ 1362.315501] [<ffffffff884cf962>] :ptlrpc:_debug_req+0x3e2/0x420
[ 1362.321764] [<ffffffff883e0c69>] :lnet:LNetMDBind+0x2b9/0x3a0
[ 1362.327918] [<ffffffff884ce89f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100
[ 1362.334858] [<ffffffff884eeb3e>] :ptlrpc:ptlrpc_lprocfs_rpc_sent+0x1e/0x120
[ 1362.342266] [<ffffffff884ca9e1>] :ptlrpc:ptl_send_rpc+0xc91/0xf70
[ 1362.348687] [<ffffffff884c25ed>]
:ptlrpc:ptlrpc_expire_one_request+0xad/0x390
[ 1362.356283] [<ffffffff884bd37e>] :ptlrpc:ptlrpc_import_delay_req+0x22e/0x240
[ 1362.363771] [<ffffffff884c311a>] :ptlrpc:ptlrpc_check_set+0x22a/0xb90
[ 1362.370626] [<ffffffff884c3c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510
[ 1362.377440] [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20
[ 1362.382975] [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1362.389284] [<ffffffff884960fb>]
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
[ 1362.397318] [<ffffffff88497133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
[ 1362.404585] [<ffffffff884aa5db>]
:ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720
[ 1362.412120] [<ffffffff884ab156>]
:ptlrpc:ldlm_process_extent_lock+0x536/0x670
[ 1362.419684] [<ffffffff80315c81>] sprintf+0x51/0x60
[ 1362.424810] [<ffffffff8849a22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0
[ 1362.431730] [<ffffffff88497928>] :ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0
[ 1362.439217] [<ffffffff88498366>]
:ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0
[ 1362.447388] [<ffffffff884aed70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
[ 1362.454153] [<ffffffff884ab5d3>] :ptlrpc:ldlm_cli_enqueue_local+0x323/0x510
[ 1362.461525] [<ffffffff886cceec>] :obdfilter:filter_destroy+0x5dc/0x1ed0
[ 1362.468519] [<ffffffff884ad280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0
[ 1362.475411] [<ffffffff884c910b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1362.481894] [<ffffffff884ce9b1>] :ptlrpc:lustre_msg_add_version+0x61/0x140
[ 1362.489175] [<ffffffff884d1e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60
[ 1362.496054] [<ffffffff8869b4a1>] :ost:ost_handle+0x1db1/0x579c
[ 1362.502251] [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90
[ 1362.507825] [<ffffffff8022b4f6>] __activate_task+0x36/0x50
[ 1362.513661] [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0
[ 1362.519778] [<ffffffff88439d05>] :obdclass:class_handle2object+0xd5/0x160
[ 1362.526982] [<ffffffff884cb1f0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70
[ 1362.534215] [<ffffffff884cfa65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
[ 1362.540699] [<ffffffff884d5bcd>]
:ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0
[ 1362.548745] [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0
[ 1362.554566] [<ffffffff883abab6>] :libcfs:lcw_update_time+0x16/0x100
[ 1362.561220] [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60
[ 1362.566962] [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0
[ 1362.572462] [<ffffffff884d7e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0
[ 1362.578796] [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1362.585055] [<ffffffff8020ac4c>] child_rip+0xa/0x12
[ 1362.590266] [<ffffffff884d75c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0
[ 1362.596491] [<ffffffff8020ac42>] child_rip+0x0/0x12
-------------------8<------------------------
/Nikke
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | [EMAIL PROTECTED]
---------------------------------------------------------------------------
FUNDAMENTALISM is never having to open your mind.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss