Hello,

we do see those messages (see below) on our OSTs when under heavy _read_ load 
(or when 60+ Jobs are trying to read data at approx the same time).
The OSTs freezes and even console output is down to a few bytes the minute.
After some time the OSTs do revocer.

How to interpret those messages to get a clue where to look further ?

1) Is the underlying Raid to slow to handle the amount of requests ?
1a) Is the MDS to slow to follow ?
2) Is it related to network issues (network too slow, too busy ??) ?
3) Would this be cured by upgrading to 1.8.x ?

Status -5 looks to us like I/O errors.
~ # grep  '5' /usr/include/*/*errno*
/usr/include/asm-generic/errno-base.h:#define   EIO              5      /* I/O 
error */

But together with the 3ware support we are pretty sure to have replaced all 
snipish disks and data transfer looks ok when not used by lustre (i.e. 
verifying up to 30-90MB/s/disk throughput).

OSTs with 2GB RAM on 16port controller, 4GB RAM on 24port controller
Raid6 (write once, read often archive system)
lustre-1.6.6
vanilla-kernel 2.6.22.19
3ware 9650se (16 and 24port) latest 9.5.3 Version
Seagate 31000340NS disks, HITACHI 1TB disks

Thanks and Regards
Heiko


Dec  4 12:42:13 sadosrd24 LustreError: 
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc 
ffff810070c9e000
Dec  4 12:42:13 sadosrd24 LustreError: 
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc 
ffff810076b78000
Dec  4 12:42:13 sadosrd24 LustreError: 
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc 
ffff810071e88000
Dec  4 12:42:13 sadosrd24 LustreError: 
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc 
ffff810070c80000
Dec  4 12:42:13 sadosrd24 LustreError: 
4650:0:(events.c:372:server_bulk_callback()) event type 4, status -5, desc 
ffff810076a80000
Dec  4 12:42:56 sadosrd24 LustreError: 
4744:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s  
r...@ffff81007efa7e00 x7869690/t0 
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 
384/336 e 0 to 0 dl 1259926976 ref 1 fl Interpret:/0/0 rc 0/0
Dec  4 12:42:56 sadosrd24 Lustre: 4744:0:(ost_handler.c:939:ost_brw_read()) 
scia-OST001d: ignoring bulk IO comm error with 
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 
12345-192.168.16....@tcp - client will retry
Dec  4 12:42:57 sadosrd24 LustreError: 
4754:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s  
r...@ffff81007dedb400 x7869691/t0 
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 
384/336 e 0 to 0 dl 1259926977 ref 1 fl Interpret:/0/0 rc 0/0
Dec  4 12:42:57 sadosrd24 Lustre: 4754:0:(ost_handler.c:939:ost_brw_read()) 
scia-OST001d: ignoring bulk IO comm error with 
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 
12345-192.168.16....@tcp - client will retry
Dec  4 12:43:00 sadosrd24 LustreError: 
4757:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s  
r...@ffff81007e8de000 x7869700/t0 
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 
384/336 e 0 to 0 dl 1259926980 ref 1 fl Interpret:/0/0 rc 0/0
Dec  4 12:43:00 sadosrd24 LustreError: 
4757:0:(ost_handler.c:882:ost_brw_read()) Skipped 2 previous similar messages
Dec  4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read()) 
scia-OST001d: ignoring bulk IO comm error with 
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 
12345-192.168.16....@tcp - client will retry
Dec  4 12:43:00 sadosrd24 Lustre: 4757:0:(ost_handler.c:939:ost_brw_read()) 
Skipped 2 previous similar messages
Dec  4 12:43:16 sadosrd24 LustreError: 
4746:0:(ost_handler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 94+0s  
r...@ffff81006de99850 x7869735/t0 
o3->eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid:0/0 lens 
384/336 e 0 to 0 dl 1259926996 ref 1 fl Interpret:/0/0 rc 0/0
Dec  4 12:43:16 sadosrd24 LustreError: 
4746:0:(ost_handler.c:882:ost_brw_read()) Skipped 2 previous similar messages
Dec  4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read()) 
scia-OST001d: ignoring bulk IO comm error with 
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff...@net_0x20000c0a8106f_uuid id 
12345-192.168.16....@tcp - client will retry
Dec  4 12:43:16 sadosrd24 Lustre: 4746:0:(ost_handler.c:939:ost_brw_read()) 
Skipped 2 previous similar messages
Dec  4 12:44:13 sadosrd24 Lustre: 
4728:0:(ldlm_lib.c:538:target_handle_reconnect()) scia-OST001d: 
eb2e7e64-c1d9-d1f6-8f9d-1ba9629ff4c0 reconnecting

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to