We managed to trigger another intriguing lustre failure by simply 
rebooting our two OSS's (which has two OST's each), and then mounting 
the OST's on both OSS's in parallell (ie mount ost1 ; mount ost2 on 
both OSS's at the same time). The filesystem was mounted by 191 
clients, of which twenty-ish hade been recently active. Interconnect 
used is standard tcp over GigE.

Both OSS's were really locking up intermittently after this, it all 
ended with having to do a hard reboot. Taking it slow and leisurely 
mounting one OST, then wait, another one, then wait until we had all 
four mounted resulted in a working lustre backend.

The OSS's are dual CPU amd64 boxes.

I tried finding a corresponding issue in bugzilla, but I failed 
miserably as ususal. Thoughts and suggestions are welcome...

We got tons of soft lockups and call traces, here are the first one 
from each OSS (with some added interesting log lines from the first OSS):
-------------------8<------------------------
[ 1376.614228] Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 
6802: it was inactive for 100s

[ 1385.082596] LustreError: 6877:0:(ldlm_lockd.c:214:waiting_locks_callback()) 
### lock callback timer expired: evicting client [EMAIL PROTECTED] nid [EMAIL 
PROTECTED]  ns: filter-hpfs-OST0000_UUID lock: 
ffff810052c08780/0x3a2961c77b2b19c1 lrc: 2/0,0 mode: PW/PW res: 1416400/0 rrc: 
2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 
10020 remote: 0x171c36570b6bb69d expref: 82 pid: 6401
[ 1388.238268] LustreError: 6877:0:(events.c:55:request_out_callback()) @@@ 
type 4, status -5  [EMAIL PROTECTED] x38/t0 o104->@NET_0x2000082ef4e80_UUID:15 
lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22
[ 1388.256244] LustreError: 6877:0:(events.c:55:request_out_callback()) Skipped 
1158333 previous similar messages
[ 1394.804079] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) 
@@@ network error (sent at 1190712998, 0s ago)  [EMAIL PROTECTED] x177/t0 
o104->@NET_0x2000082ef4e8e_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22
[ 1394.825111] LustreError: 6731:0:(client.c:961:ptlrpc_expire_one_request()) 
Skipped 1959868 previous similar messages
[ 1395.921898] BUG: soft lockup detected on CPU#0!
[ 1395.926481] 
[ 1395.926482] Call Trace:
[ 1395.930474]  <IRQ> [<ffffffff80263e7c>] softlockup_tick+0xfc/0x120
[ 1395.936758]  [<ffffffff8023f1c7>] update_process_times+0x57/0x90
[ 1395.942820]  [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50
[ 1395.949332]  [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50
[ 1395.955751]  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
[ 1395.961811]  <EOI> [<ffffffff803153b9>] number+0x109/0x270
[ 1395.967399]  [<ffffffff80315ad1>] vsnprintf+0x5b1/0x630
[ 1395.972707]  [<ffffffff883ad5d4>] :libcfs:libcfs_debug_vmsg2+0x494/0x8e0
[ 1395.979488]  [<ffffffff8022b8e3>] __wake_up_common+0x43/0x80
[ 1395.985252]  [<ffffffff884d5520>] :ptlrpc:request_out_callback+0x0/0x1f0
[ 1395.992056]  [<ffffffff884d5520>] :ptlrpc:request_out_callback+0x0/0x1f0
[ 1395.998859]  [<ffffffff884d1962>] :ptlrpc:_debug_req+0x3e2/0x420
[ 1396.004963]  [<ffffffff8853b7a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100
[ 1396.011651]  [<ffffffff8853ccbb>] :ksocklnd:ksocknal_send+0x2eb/0x330
[ 1396.018188]  [<ffffffff884d5601>] :ptlrpc:request_out_callback+0xe1/0x1f0
[ 1396.025056]  [<ffffffff883e0b5b>] :lnet:lnet_enq_event_locked+0x9b/0xf0
[ 1396.031747]  [<ffffffff883e0f69>] :lnet:lnet_finalize+0x119/0x260
[ 1396.037914]  [<ffffffff883e7ba4>] :lnet:lnet_send+0x9c4/0x9e0
[ 1396.043728]  [<ffffffff883e3437>] :lnet:lnet_prep_send+0x67/0xb0
[ 1396.049807]  [<ffffffff883e8963>] :lnet:LNetPut+0x6a3/0x740
[ 1396.055445]  [<ffffffff883e2c69>] :lnet:LNetMDBind+0x2b9/0x3a0
[ 1396.061370]  [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1396.067643]  [<ffffffff884cc9ce>] :ptlrpc:ptl_send_rpc+0xc7e/0xf70
[ 1396.073914]  [<ffffffff884c555a>] :ptlrpc:ptlrpc_check_set+0x66a/0xb90
[ 1396.080527]  [<ffffffff884bf4a9>] :ptlrpc:ptlrpc_set_next_timeout+0x19/0x170
[ 1396.087674]  [<ffffffff884c5c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510
[ 1396.094191]  [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20
[ 1396.099560]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1396.105652]  [<ffffffff884980fb>] 
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
[ 1396.113405]  [<ffffffff88499133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
[ 1396.120369]  [<ffffffff884ac5db>] 
:ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720
[ 1396.127684]  [<ffffffff884ad156>] 
:ptlrpc:ldlm_process_extent_lock+0x536/0x670
[ 1396.134986]  [<ffffffff80315c81>] sprintf+0x51/0x60
[ 1396.139923]  [<ffffffff80287ebe>] transfer_objects+0x4e/0x80
[ 1396.145663]  [<ffffffff8849c22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0
[ 1396.152382]  [<ffffffff88499928>] :ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0
[ 1396.159615]  [<ffffffff8849a366>] 
:ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0
[ 1396.167454]  [<ffffffff884b0d70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
[ 1396.173975]  [<ffffffff884ad5d3>] :ptlrpc:ldlm_cli_enqueue_local+0x323/0x510
[ 1396.181110]  [<ffffffff886ceeec>] :obdfilter:filter_destroy+0x5dc/0x1ed0
[ 1396.187913]  [<ffffffff884af280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0
[ 1396.194624]  [<ffffffff884cb10b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1396.200888]  [<ffffffff884d09b1>] :ptlrpc:lustre_msg_add_version+0x61/0x140
[ 1396.207939]  [<ffffffff884d3e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60
[ 1396.214636]  [<ffffffff8869d4a1>] :ost:ost_handle+0x1db1/0x579c
[ 1396.220607]  [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90
[ 1396.225979]  [<ffffffff8022b4f6>] __activate_task+0x36/0x50
[ 1396.231617]  [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0
[ 1396.237559]  [<ffffffff8843bd05>] :obdclass:class_handle2object+0xd5/0x160
[ 1396.244535]  [<ffffffff884cd1f0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70
[ 1396.251506]  [<ffffffff884d1a65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
[ 1396.257870]  [<ffffffff884d7bcd>] 
:ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0
[ 1396.265626]  [<ffffffff884d951e>] :ptlrpc:ptlrpc_start_thread+0x95e/0xa00
[ 1396.272492]  [<ffffffff80416ae0>] thread_return+0x0/0x100
[ 1396.277940]  [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0
[ 1396.283579]  [<ffffffff883adab6>] :libcfs:lcw_update_time+0x16/0x100
[ 1396.289997]  [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60
[ 1396.295627]  [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0
[ 1396.300925]  [<ffffffff884d9e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0
[ 1396.307083]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1396.313161]  [<ffffffff8020ac4c>] child_rip+0xa/0x12
[ 1396.318197]  [<ffffffff884d95c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0
[ 1396.324193]  [<ffffffff8020ac42>] child_rip+0x0/0x12
-------------------8<------------------------

-------------------8<------------------------
[ 1362.254118] BUG: soft lockup detected on CPU#1!
[ 1362.258808] 
[ 1362.258809] Call Trace:
[ 1362.262940]  <IRQ> [<ffffffff80263e7c>] softlockup_tick+0xfc/0x120
[ 1362.269408]  [<ffffffff8023f1c7>] update_process_times+0x57/0x90
[ 1362.275742]  [<ffffffff8021a3e3>] smp_local_timer_interrupt+0x23/0x50
[ 1362.282462]  [<ffffffff8021acf1>] smp_apic_timer_interrupt+0x41/0x50
[ 1362.289117]  [<ffffffff8020a936>] apic_timer_interrupt+0x66/0x6c
[ 1362.295395]  <EOI> [<ffffffff8031588b>] vsnprintf+0x36b/0x630
[ 1362.301519]  [<ffffffff883ab5d4>] :libcfs:libcfs_debug_vmsg2+0x494/0x8e0
[ 1362.308574]  [<ffffffff885397a7>] :ksocklnd:ksocknal_free_tx+0x77/0x100
[ 1362.315501]  [<ffffffff884cf962>] :ptlrpc:_debug_req+0x3e2/0x420
[ 1362.321764]  [<ffffffff883e0c69>] :lnet:LNetMDBind+0x2b9/0x3a0
[ 1362.327918]  [<ffffffff884ce89f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100
[ 1362.334858]  [<ffffffff884eeb3e>] :ptlrpc:ptlrpc_lprocfs_rpc_sent+0x1e/0x120
[ 1362.342266]  [<ffffffff884ca9e1>] :ptlrpc:ptl_send_rpc+0xc91/0xf70
[ 1362.348687]  [<ffffffff884c25ed>] 
:ptlrpc:ptlrpc_expire_one_request+0xad/0x390
[ 1362.356283]  [<ffffffff884bd37e>] :ptlrpc:ptlrpc_import_delay_req+0x22e/0x240
[ 1362.363771]  [<ffffffff884c311a>] :ptlrpc:ptlrpc_check_set+0x22a/0xb90
[ 1362.370626]  [<ffffffff884c3c30>] :ptlrpc:ptlrpc_set_wait+0x1b0/0x510
[ 1362.377440]  [<ffffffff804186b9>] _spin_lock_bh+0x9/0x20
[ 1362.382975]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1362.389284]  [<ffffffff884960fb>] 
:ptlrpc:ldlm_send_and_maybe_create_set+0x1b/0x200
[ 1362.397318]  [<ffffffff88497133>] :ptlrpc:ldlm_run_bl_ast_work+0x1f3/0x290
[ 1362.404585]  [<ffffffff884aa5db>] 
:ptlrpc:ldlm_extent_compat_queue+0x5ab/0x720
[ 1362.412120]  [<ffffffff884ab156>] 
:ptlrpc:ldlm_process_extent_lock+0x536/0x670
[ 1362.419684]  [<ffffffff80315c81>] sprintf+0x51/0x60
[ 1362.424810]  [<ffffffff8849a22f>] :ptlrpc:ldlm_lock_enqueue+0x4ff/0x5b0
[ 1362.431730]  [<ffffffff88497928>] :ptlrpc:ldlm_lock_remove_from_lru+0x78/0xf0
[ 1362.439217]  [<ffffffff88498366>] 
:ptlrpc:ldlm_lock_addref_internal_nolock+0x46/0xa0
[ 1362.447388]  [<ffffffff884aed70>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0
[ 1362.454153]  [<ffffffff884ab5d3>] :ptlrpc:ldlm_cli_enqueue_local+0x323/0x510
[ 1362.461525]  [<ffffffff886cceec>] :obdfilter:filter_destroy+0x5dc/0x1ed0
[ 1362.468519]  [<ffffffff884ad280>] :ptlrpc:ldlm_completion_ast+0x0/0x6e0
[ 1362.475411]  [<ffffffff884c910b>] :ptlrpc:ptl_send_buf+0x4ab/0x650
[ 1362.481894]  [<ffffffff884ce9b1>] :ptlrpc:lustre_msg_add_version+0x61/0x140
[ 1362.489175]  [<ffffffff884d1e48>] :ptlrpc:lustre_pack_reply+0xa28/0xb60
[ 1362.496054]  [<ffffffff8869b4a1>] :ost:ost_handle+0x1db1/0x579c
[ 1362.502251]  [<ffffffff8022cacc>] task_rq_lock+0x4c/0x90
[ 1362.507825]  [<ffffffff8022b4f6>] __activate_task+0x36/0x50
[ 1362.513661]  [<ffffffff8022e1b5>] wake_up_new_task+0x295/0x2b0
[ 1362.519778]  [<ffffffff88439d05>] :obdclass:class_handle2object+0xd5/0x160
[ 1362.526982]  [<ffffffff884cb1f0>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x70
[ 1362.534215]  [<ffffffff884cfa65>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
[ 1362.540699]  [<ffffffff884d5bcd>] 
:ptlrpc:ptlrpc_server_handle_request+0xc0d/0x13b0
[ 1362.548745]  [<ffffffff8020df8e>] do_gettimeofday+0x5e/0xb0
[ 1362.554566]  [<ffffffff883abab6>] :libcfs:lcw_update_time+0x16/0x100
[ 1362.561220]  [<ffffffff8023f2c9>] lock_timer_base+0x29/0x60
[ 1362.566962]  [<ffffffff8023f7b0>] __mod_timer+0xc0/0xf0
[ 1362.572462]  [<ffffffff884d7e2b>] :ptlrpc:ptlrpc_main+0x86b/0x9f0
[ 1362.578796]  [<ffffffff8022f450>] default_wake_function+0x0/0x10
[ 1362.585055]  [<ffffffff8020ac4c>] child_rip+0xa/0x12
[ 1362.590266]  [<ffffffff884d75c0>] :ptlrpc:ptlrpc_main+0x0/0x9f0
[ 1362.596491]  [<ffffffff8020ac42>] child_rip+0x0/0x12
-------------------8<------------------------


/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    [EMAIL PROTECTED]
---------------------------------------------------------------------------
  FUNDAMENTALISM is never having to open your mind.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to