Hi Andreas,

Thanks. Disabling debug on clients made a good difference in small file operations.

There are quite a lot of lustre related errors in the log. The file system is working fine.


Now i have so many of the following error messages and stack dumps in the logs and many files
created as /tmp/lustre.log.-----

   Could disabling debug on a running client cause this?

Thanks very much


Balagopal



LustreError: 3396:0:(osc_create.c:82:osc_interpret_create()) @@@ unknown rc -11 from async create: failing oscc [EMAIL PROTECTED] x12281982/t0 o5->[EMAIL PROTECTED]@tcp:28 lens 336/336 ref 1 fl Complete:R/2/0 rc -11/-11 Lustre: scratch-OST0000-osc: Connection to service scratch-OST0000 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. LustreError: 3397:0:(client.c:574:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -16 [EMAIL PROTECTED] x12282014/t0 o8->[EMAIL PROTECTED]@tcp:28 lens 304/328 ref 1 fl Rpc:R/0/0 rc 0/-16 Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 3563: it was inactive for 100s Lustre: 0:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 3563 ll_mdt_17 S 0000000000000000 0 3563 1 3564 3562 (L-TLB)
0000010074ec3238 0000000000000046 0000010077ff8800 0000010000000073
     0000010074ec31c8 00000000801331f6 0000010001029aa0 0000000100000008
     000001007ce15030 000000000000009d
Call Trace:<ffffffffa049e528>{:osc:oscc_has_objects+259} <ffffffffa049e9a6>{:osc:oscc_precreate+580} <ffffffff801331a5>{default_wake_function+0} <ffffffffa049f86b>{:osc:osc_create+3297} <ffffffffa0364a76>{:lov:qos_prep_create+3728} <ffffffffa035bbe4>{:lov:lov_prep_create_set+541} <ffffffffa0348d59>{:lov:lov_create+6516} <ffffffffa03dd4af>{:ldiskfs:ldiskfs_xattr_ibody_get+403}
     <ffffffffa0429f3d>{:fsfilt_ldiskfs:fsfilt_ldiskfs_get_md+101}
<ffffffffa04415c1>{:mds:mds_get_md+105} <ffffffffa046155b>{:mds:mds_create_objects+3818} <ffffffffa04421c2>{:mds:mds_pack_md+409} <ffffffffa04631aa>{:mds:mds_finish_open+704} <ffffffffa0465625>{:mds:mds_open+6486} <ffffffffa0459c91>{:mds:mds_reint_rec+373} <ffffffffa0443e26>{:mds:mds_reint+637} <ffffffffa044c22a>{:mds:mds_intent_policy+890}
     <ffffffffa0278eee>{:ptlrpc:ldlm_resource_putref+356}
<ffffffffa0274ca2>{:ptlrpc:ldlm_lock_create+1375} <ffffffffa0275d85>{:ptlrpc:ldlm_lock_enqueue+208}
     <ffffffffa02887d7>{:ptlrpc:ldlm_handle_enqueue+2524}
     <ffffffffa02871ba>{:ptlrpc:ldlm_server_blocking_ast+0}
     <ffffffffa028772d>{:ptlrpc:ldlm_server_completion_ast+0}
<ffffffffa04483a9>{:mds:mds_handle+15254} <ffffffffa01b3f69>{:libcfs:libcfs_debug_vmsg2+1713}
     <ffffffff801e9ce7>{vsnprintf+1406} <ffffffff801e9dca>{snprintf+131}
     <ffffffffa02a1e49>{:ptlrpc:ptlrpc_server_handle_request+2514}
     <ffffffffa02a1e57>{:ptlrpc:ptlrpc_server_handle_request+2528}
<ffffffff8013f100>{__mod_timer+293} <ffffffffa02a3d1b>{:ptlrpc:ptlrpc_main+2018} <ffffffff801331a5>{default_wake_function+0} <ffffffffa02a297a>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa02a297a>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff80110e23>{child_rip+8} <ffffffffa02a3539>{:ptlrpc:ptlrpc_main+0} <ffffffff80110e1b>{child_rip+0} LustreError: dumping log to /tmp/lustre-log.1182448666.3563 Lustre: scratch-OST0000-osc: Connection restored to service scratch-OST0000 using nid [EMAIL PROTECTED]
Lustre: scratch-OST0000: received MDS connection from [EMAIL PROTECTED]
Lustre: 3650:0:(recov_thread.c:565:llog_repl_connect()) llcd 000001001d47d000:000001007453e180 not empty Lustre: MDS scratch-MDT0000: scratch-OST0000_UUID now active, resetting orphans LustreError: 3479:0:(llog_server.c:435:llog_origin_handle_cancel()) cancel 2 llog-records failed: -22 LustreError: 3705:0:(recov_thread.c:396:log_commit_thread()) commit 000001001d47d000:000001007453e180 drop 2 cookies: rc -22 LustreError: 3563:0:(service.c:668:ptlrpc_server_handle_request()) request 13132122 opc 101 from [EMAIL PROTECTED] processed in 101s trans 38707883 rc 301/301 Lustre: 3563:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 3563 disabled after 101.9196s LustreError: 0:0:(ldlm_lockd.c:213:waiting_locks_callback()) ### lock callback timer expired: evicting client [EMAIL PROTECTED] nid [EMAIL PROTECTED] ns: filter-scratch-OST0000_UUID lock: 0000010070c69680/0x24c4cddb1a7d1a3a lrc: 2/0,0 mode: PR/PR res: 5587403/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 10020 remote: 0x442628bd10ee4706 expref: 577 pid: 3637 Lustre: 3568:0:(mds_reint.c:124:mds_finish_transno()) commit transaction for disconnected client ae77a3e4-6d9d-217a-6033-fefb69622a73: rc 0 LustreError: 3687:0:(lustre_fsfilt.h:205:fsfilt_start_log()) scratch-OST0000: slow journal start 62s LustreError: 3669:0:(ldlm_lib.c:1363:target_send_reply_msg()) @@@ processing error (-107) [EMAIL PROTECTED] x13248558/t0 o6-><?>@<?>:-1 lens 336/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 3669:0:(ldlm_lib.c:1363:target_send_reply_msg()) Skipped 2 previous similar messages LustreError: 3683:0:(filter.c:1139:filter_parent_lock()) scratch-OST0000: slow parent lock 62s LustreError: 4950:0:(mds_open.c:1461:mds_close()) @@@ no handle for file close ino 29157575: cookie 0x24c4cddb1a00178a [EMAIL PROTECTED] x13280339/t0 o35->[EMAIL PROTECTED]:-1 lens 296/392 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 4950:0:(mds_open.c:1461:mds_close()) Skipped 1 previous similar message Lustre: 6370:0:(lustre_fsfilt.h:205:fsfilt_start_log()) scratch-OST0000: slow journal start 36s LustreError: 6423:0:(lustre_fsfilt.h:205:fsfilt_start_log()) scratch-OST0000: slow journal start 80s LustreError: 6423:0:(lustre_fsfilt.h:205:fsfilt_start_log()) Skipped 79 previous similar messages LustreError: 4883:0:(lustre_fsfilt.h:240:fsfilt_brw_start_log()) scratch-OST0000: slow journal start 65s LustreError: 4909:0:(filter_io_26.c:680:filter_commitrw_write()) scratch-OST0000: slow brw_start 65s LustreError: 4909:0:(filter_io_26.c:680:filter_commitrw_write()) Skipped 7 previous similar messages LustreError: 4883:0:(lustre_fsfilt.h:240:fsfilt_brw_start_log()) Skipped 11 previous similar messages LustreError: 4877:0:(filter_io_26.c:732:filter_commitrw_write()) scratch-OST0000: slow direct_io 65s LustreError: 4877:0:(filter_io_26.c:732:filter_commitrw_write()) Skipped 2 previous similar messages LustreError: 4887:0:(lustre_fsfilt.h:240:fsfilt_brw_start_log()) scratch-OST0000: slow journal start 66s LustreError: 4887:0:(filter_io_26.c:680:filter_commitrw_write()) scratch-OST0000: slow brw_start 66s LustreError: 4887:0:(filter_io_26.c:680:filter_commitrw_write()) Skipped 4 previous similar messages LustreError: 4909:0:(filter_io_26.c:745:filter_commitrw_write()) scratch-OST0000: slow commitrw com
Andreas Dilger wrote:
On Jun 21, 2007  10:20 -0300, Balagopal Pillai wrote:
I am using Lustre 1.6.0.1 with one OST and 20 clients in an HPC cluster.
The OST/MDT/MGS has a 16 channel 3ware 9650 using raid6.

*Now the problem - Lustre performance dips a lot when it comes to small files. Please see the following fileop -f 5 test comparing NFS and Lustre -

We have had reports of read performance problems with 3ware in the past.

Lustre -
Fileop: File size is 1, Output is in Ops/sec. (A=Avg, B=Best, W=Worst) . mkdir rmdir create read write close stat access chmod readdir link unlink delete Total_files A 5 1654 691 132 14228 719 4874 1987 32737 1718 2506 1262 1340 1608 125


NFS -
Fileop:  File size is 1,  Output is in Ops/sec. (A=Avg, B=Best, W=Worst)
. mkdir rmdir create read write close stat access chmod readdir link unlink delete Total_files A 5 177 594 459 380747 137392 2282 1219 444312 502 1274 306 513 464 125

Could you please recommend any tunables to get a bit more performance out of Lustre with lots of small files? Lots of small files was bad in GFS too, but it was better than NFS though.

Try setting the debug on the clients to zero "sysctl -w lnet.debug=0".

*Also the read performance of Lustre seems to be a little behind NFS. I had /opt which has all the software for users moved to Lustre in the new setup. But software like Matlab, Splus etc takes almost a minute to come up. The second time is very fast though, maybe due to caching. So i am thinking of putting /opt
back to NFS. Is it possible to boost the read performance of Lustre a bit?

Turn off debug is the easiest.

*One last question. In the older Lustre setup (version 1.4.5), i have 5 scsi drives one each as an OST for a single volume. The volume became full. But df still reported that there is 27GB free. There doesn't seem
to be an lfs df option in that version of Lustre. So i couldn't see the
individual utilization of each of the 5 OST. Is this a striping problem?

In the absence of "lfs df" use /proc to find per-OST usage on the client:

grep '[0-9]' /proc/fs/lustre/osc/*/kbytes*

Lustre 1.6 should be better at balancing the OST usage than 1.4.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to